Objective
Gathering Data
Prepare Data for Modeling
- Supervised/Unsupervised/Regression/Classfication
- Data Visualization
- Data Cleaning: Missing Values, Outlier Removal
- Feature Extraction: Interaction, Lagging, One-Hot Encoding
- Feature Selection: Univariate/Recursive
- Split data into train/test
- Under Sampling, Over Sampling
- Scaling, Normalizing, outlier removal
Baseline Modeling
- Select model
- Linear Regression, Logistic Regression
- K-nearest neighbors, Decision Tree
- Support Vector Machine, Random Forest
- Parameteric Models: Multicolinearity, Relation with the outcome
- Evaluation Metric: ROC AUC, PR AUC, Accuracy, R2, MSE, RMSE, RSS, MAE
- Fit model on train-set
- Test model on test-set
- Feature importances, ANOVA table(stats_models), Coefficents
Secondary Modeling
- Reduce Overfitting
- Hyperparameter Tuning: L1, L2 penality
- Gradient Boosting, XGBoost, Custom Ensembles
- Improve generalization error
import pandas as pd
import pickle
from geopy.distance import great_circle
import operator
from scipy import sparse
from scipy.sparse import csr_matrix, vstack, hstack, coo_matrix
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
%matplotlib inline
Predict whether a restaurant will be closed or not based on its geographical location and user reviews. Most businesses don't just open and close in short period of time. They stay for years before closing. The newest Yelp Dataset was released before 6 months. It contains data about businesses that were opened and closed since 2004. Thus, it will be useful to predict whether a restaurant will close in the next few years. Banks can use this model to make informed loan decisions. Because loans are mostly long-term. Investors want to put their money on a restaurant that is likely to remain for many years.
I searched for related works that try to predict restaurants closure, The project I am aiming to develop is unique in the following aspects:
It can better generalize across cities because it is trained on 11 meteropolitan areas. The related projects were focused on one city.
It aims at improving perfomance by combining bag of words with other features, like the surrounding performance. Researchers at Univeristy of Maryland worked on the textual features only and Michail Alifierakis used features related to the surrounding performance only. He didn't encoporate textual features.
It uses the most recent version of the dataset which was released before 6 months.
This dataset is a subset of Yelp's businesses, reviews, and user data. In total, there are :
It is a huge dataset. I beleive that more data contributes in imporivng model generalization.Loading the data to memory at once is impossible. I considered the following options
Pandas can read in chunks.I utilized Google Cloud Platform to create a virtual machine with 8 CPUs and 56 GB RAM. The project involved trying models on sets of features, grid search acorss pipelines and processing textual data. The cloud compute engine really helped me see results quickly.
b = pd.read_json("yelp_academic_dataset_business.json", lines=True)
Contains business data including location data, attributes, and categories.
b.head(6)
See Categories more closely
b.categories.sample(5)
The Yelp Category List
Each one of these contain other specific categories
Restaurants
Attributes are stored as a nested JSON object.
b.attributes[1005]
b.columns
All the attributes are displayed as follows. Star rating average of all stars recieved by the business, rounded to half-stars.
Checkins on a business.
csize = 1e2
checkin_reader = pd.read_json("yelp_academic_dataset_checkin.json", lines=True, chunksize=csize)
for i, ch in enumerate(checkin_reader):
if(i > 0):
break
ch.columns
ch.head()
Contains full review text data including the user_id that wrote the review and the business_id the review is written for.
The reviews data is more than 5 GB, uploading it directly took time. I decided to upload the data to an ftp server and transfer it from there. Python has a builtin libraries: ftplib and requests, which can read data from ftp server.
# # import neccessary libraries
# from ftplib import FTP
# import requests
# # login to ftp server
# server = "70.40.217.80"
# username = "tinsaecares@tinsaealemayehu.com"
# password = "[RQ-K$%[1m;N"
# ftp = FTP(server)
# ftp.login(user=username, passwd=password)
# rvfile = open("yelp_academic_dataset_tip.json", "wb")
# ftp.retrbinary("yelp_academic_dataset_tip.json", rvfile.write)
# rvfile.close()
# # Downloading directly from yelp to computer
# !wget -c "https://storage.googleapis.com/kaggle-datasets/10100/277695/yelp-dataset.zip?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1551046405&Signature=p%2BBda9sOLceu8EceqHAhrOPKgraMDDfZ8%2FqiW1z8DygzgoJaUzBW5xL8MWtD5z55OVJWe61Zv9qpdVeHCXcZ0r4mpV7dezhwA2Mmus2OyJqAhhWoXxPSieNB36fHaFmrm6xu%2FOEp5R1TJ2s72vWKU7tnOJS9%2BFBYnvOuV6cPN4lrpVVTrixOjCLv8QQWndfGR7V1Wcz2MqLaqAVFWuN94WU6ycz%2BlcSvdgqPplNHmmzcT%2BBvepxcuTheWMPuHusGUPJ1FHHQ4BEv8RgrhtTvXtV6jxupyw69eMHWpdN0J4bWHACc2%2BGthxgK00Tplt%2FrINAFiPATMf7dpciyNKigAA%3D%3D" -O yelp.zip
csize = 1e2
review_reader = pd.read_json("yelp_academic_dataset_review.json", lines=True, chunksize=csize)
for i, r in enumerate(review_reader):
if(i > 0):
break
r.columns
r.head()
See the text, stars, date and user. A review can receive votes from other users: useful, cool, funny. The business that is reviewed, the user who reviewed are linked by business_id and review_id respectively.
Tips written by a user on a business. Tips are shorter than reviews and tend to convey quick suggestions.
csize=1e2
tip_reader = pd.read_json("yelp_academic_dataset_tip.json", lines=True, chunksize=csize)
for i, t in enumerate(tip_reader):
if(i > 0):
break
t.columns
t.head()
User data including the user's friend mapping and all the metadata associated with the user.
user_reader = pd.read_json("yelp_academic_dataset_user.json", lines=True, chunksize=csize)
csize=1e2
for i, u in enumerate(user_reader):
if(i > 0):
break
u.columns
u.head()
useful, funny, cool exisit in user table. They refer to number of votes sent by the user. On review table these features were counting number of votes given to a review.
Users can recieve compliments from other users. A complement is given on the profile page. It is not given for a single review, rather on the overall activity of a user.
| Compliment Type | Profile |
|---|---|
# the code below is just practice code
# from scipy.sparse import csr_matrix, vstack, hstack, coo_matrix
# from IPython.display import HTML, display
# import tabulate
# list1 = np.array([[1,0,0], [4,5,0]])
# list1 = [[1,0,0], [4,5,0]]
# matrix1 = csr_matrix(list1)
# display(HTML(tabulate.tabulate(list1, tablefmt='html')))
# print(matrix1)
# print()
# # stacking a new row
# matrix2 = vstack((matrix1, [0,9,0]))
# list2 = matrix2.toarray()
# display(HTML(tabulate.tabulate(list2, tablefmt='html')))
# print(matrix2)
# print()
# # stacking a new column on matrix1
# matrix3 = hstack((matrix1,[[0],[4.5]] ))
# list3 = matrix3.toarray()
# display(HTML(tabulate.tabulate(list3, tablefmt='html')))
# print(matrix3)
# print()
# # stacking a two columns on matrix1
# matrix4 = hstack((matrix1,[[0,9.9589],[4.5214, 0]] ))
# list4 = matrix4.toarray()
# display(HTML(tabulate.tabulate(list4, tablefmt='html')))
# print(matrix4)
# print()
# # rounding numpy matrices
# matrix4.data= np.round(matrix4.data,2)
# print(matrix4)
# nulls in percent
((b.isnull().sum()/len(b)) * 100).round(5)
nulls_count = pd.DataFrame({"open":(b[b.is_open == True].isnull().sum()/len(b[b.is_open == False])).round(5),
"closed":(b[b.is_open == False].isnull().sum()/len(b[b.is_open == False])).round(5)})
nulls_count = nulls_count[(nulls_count.open > 0) | (nulls_count.closed > 0)]
nulls_count["feature"] = nulls_count.index
nulls_count_melted = pd.melt(nulls_count, id_vars=['feature'], value_vars=['open', 'closed'])
sns.pointplot(x="feature", y="value", hue="variable", kind="point", data=nulls_count_melted)
sns.despine()
plt.show()
No, closed business have much lesser nulls
sns.set(rc={'figure.figsize':(8,6)})
sns.countplot(b.is_open)
plt.title("Imbalanced Classes")
plt.xticks([0,1], ("Closed", "Open"))
plt.xlabel("Class")
plt.ylabel("Frequency")
sns.despine()
plt.show()
pd.value_counts(b.is_open)/b.shape[0]
23% of hours is missing. I can drop hours
0.29 % rows have missing category. I can drop those rows
0.001 % rows have missing latitude, longitude information. Dropping them will not affect the data
I am not going to use the attributes feature.
b.drop("hours", axis=1, inplace=True)
b = b.drop(b[b.categories.isnull()].index)
b = b.drop(b[(b.latitude.isnull()) | (b.longitude.isnull()) ].index)
b = b.drop("attributes", axis=1)
# nulls in percent
((b.isnull().sum()/len(b)) * 100).round(5)
A business is labeled to more than one categories. Converting them to binary will make filtering and analyzing very easy. Let me try to one-hot encode them.
b.categories.head()
Splitting by comma works except for businesses that have a single category seprated by commas.
Since there is no syntatical difference beween single category containing comma and multiple categories separated by comma, i couln't find a regex expression to deal with this problem. Replacing manually is the remaining option.
b.loc[:, 'categories'] = b.categories.str.replace("Books, Mags, Music & Video", "Books+Mags+ & MusicVideo")
b.loc[:, 'categories'] = b.categories.str.replace("Used, Vintage & Consignment", "Used+Vintage+ & Consignment")
b.loc[:, 'categories'] = b.categories.str.replace("Beer, Wine & Spirits", "Beer+Wine & Spirits")
b.loc[:, 'categories'] = b.categories.str.replace("Wills, Trusts, & Probates", "Wills+Trusts+&Probates")
One hot encoding is for single label categorical variables. For multiple labels Scikit-learn's MultiLabelBinarizer is the way to go.
# split by comma
categories_series = b.categories.apply(lambda x: x.split(","))
# a lambda function to convert the following data structure
# [list(['Tours', ' Breweries', ' Pizza', ' Restaurants', ' Food', ' Hotels & Travel'])
# to
# ['Tours', ' Breweries', ' Pizza', ' Restaurants', ' Food', ' Hotels & Travel']
take_out_list = lambda x: list(x)
# get 2D list
categories_2D_list = take_out_list(categories_series)
# remove leading and trailing white spaces
for alist in categories_2D_list:
alist[:] = [category.strip() for category in alist]
#print(categories_2D_list[:3])
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
categories_binarized = mlb.fit_transform(categories_2D_list)
#print(categories_binarized.shape)
#print(mlb.classes_.shape)
# mlb.classes_ returns the unique categories found by the binarizer
categories_binarized_df = pd.DataFrame(categories_binarized, columns=list(mlb.classes_))
categories_binarized_df.sample(10)
mlb.classes_
top_20_categories = categories_binarized_df.sum().sort_values(ascending=False)[:20]
sns.set(rc={'figure.figsize':(11,8)})
ax = sns.barplot(y=top_20_categories.index, x=top_20_categories.values, orient='h')
sns.despine()
# find the number of total busineses
total_business = categories_binarized_df.shape[0]
combinations_of_top20 = categories_binarized_df[top_20_categories.index].apply(lambda x: list(x.index[np.where(x != 0)]), axis=1)
# convert index to string to make it hashable
import json
# dumps converts list to string
combinations_of_top20 = combinations_of_top20.map(json.dumps)
value_counts_top20 = combinations_of_top20.value_counts()
value_counts_top20
count_g_2 = 0
for c in value_counts_top20.index:
# count combinations of two or more categories
# loads converts back to list
if(len(json.loads(c)) > 2):
count_g_2 += value_counts_top20.loc[c]
print(count_g_2)
print(np.round((count_g_2/total_business) * 100, 2), "%")
Only 14% of all businesses have combined categories from the top-20 categories. This indicates that most of these are top level categories encompassing many businesses.
total_business
There are 1888045 businesses in the dataset.
top_20_categories.index
# concatenate is_open info to the binaried categories data
open_close_df = pd.concat([categories_binarized_df[top_20_categories.index], b.is_open], axis=1)
# find all open businesses
open_df = open_close_df[open_close_df.is_open ==1]
# find all closed businesses
close_df = open_close_df[open_close_df.is_open ==0]
# sum across columns to find number of open business in each column(category)
open_close_count = (pd.concat([open_df.sum(), close_df.sum()], axis=1))
# change the default column names
open_close_count.columns = ["Open", "Closed"]
#print(open_close_count)
# divide the counts by the total businesses in each category
open_close_count_normalized = open_close_count.div(open_df.sum() + close_df.sum(), axis=0).sort_values(by="Closed", ascending=False)[:-2]
sns.set(rc={'figure.figsize':(8,8)})
ax = sns.pointplot(x=open_close_count_normalized.Closed, y=open_close_count_normalized.index, orient='h')
sns.despine()
Those businesses that are at the end of the top 20 list are on the top on this graph and vice versa. Fast-food and Coffee/Tea have the highest closure rate though they are limited in number compared to the other businesses. Shopping centers and Health and Medical businesses are in large quantities but they are least likely to be closed.
categories_binarized_df["business_id"] = b.business_id.values
categories_binarized_df["is_open"] = b.is_open.values
restaurants_df = categories_binarized_df[categories_binarized_df.Restaurants == 1]
shopping_df = categories_binarized_df[categories_binarized_df.Shopping == 1]
total_reviews = 5996996
# # the code below takes time. The results are saved to file and loaded in the next cell
# csize = 1e6
# review_reader = pd.read_json("yelp_academic_dataset_review.json", lines=True, chunksize=csize)
# num_restaurant_reviews = 0
# num_shopping_reviews = 0
# total_reviews = 0
# for i, r in enumerate(review_reader):
# total_reviews += r.shape[0]
# print("working on chunk ",i," shape =", r.shape[0])
# # count number of restaurant reviews found in chunk r
# num_restaurant_reviews += pd.merge(restaurants_df, r, on=["business_id"]).shape[0]
# # count number of shopping reviews found in chunk r
# num_shopping_reviews += pd.merge(shopping_df, r, on=["business_id"]).shape[0]
# pickle.dump(num_restaurant_reviews, open("yelp-dataset/num_restaurant_reviews.sav", 'wb'))
# pickle.dump(num_shopping_reviews,open("yelp-dataset/num_shopping_reviews.sav", 'wb'))
num_restaurant_reviews = pickle.load(open("yelp-dataset/num_restaurant_reviews.sav", 'rb'))
num_shopping_reviews = pickle.load(open("yelp-dataset/num_shopping_reviews.sav", 'rb'))
print("Percent of Restaurant Businesses: {:.2f}".format(restaurants_df.shape[0] / total_business))
print("Percent of Shopping Businesses: {:.2f}".format(shopping_df.shape[0] / total_business))
print()
print("Percent of Restaurant Reviews: {:.2f}".format(num_restaurant_reviews / total_reviews))
print("Percent of Shopping Reviews: {:.2f}".format(num_shopping_reviews / total_reviews))
There are a lot of restaurants on Yelp. Second top are shopping businesses. When you see the number of reviews restaurants take more share than their number. Thus, restaurants are highly reviewed.
#The following code takes time. Saved data is loaded in the next cell
# csize=1e6
# review_reader = pd.read_json("yelp_academic_dataset_review.json", lines=True, chunksize=csize)
# total_reviews = 0
# restaurant_open_review_years = []
# restaurant_closed_review_years = []
# shopping_open_review_years = []
# shopping_closed_review_years = []
# for i, r in enumerate(review_reader):
# total_reviews += r.shape[0]
# print("working on chunk ",i," shape =", r.shape[0])
# restaurant_reviewsR = pd.merge(restaurants_df, r, on=["business_id"])
# restaurant_reviews_closed = restaurant_reviewsR[restaurant_reviewsR.is_open == 0]
# restaurant_reviews_open = restaurant_reviewsR[restaurant_reviewsR.is_open == 1]
# restaurant_closed_review_years.append(pd.to_datetime(restaurant_reviews_closed.date.dt.year).values)
# restaurant_open_review_years.append(pd.to_datetime(restaurant_reviews_open.date.dt.year).values)
# shopping_restaurantR= pd.merge(shopping_df, r, on=["business_id"])
# shopping_reviews_closed = shopping_restaurantR[shopping_restaurantR.is_open == 0]
# shopping_reviews_open = shopping_restaurantR[shopping_restaurantR.is_open == 1]
# shopping_closed_review_years.append(pd.to_datetime(shopping_reviews_closed.date.dt.year).values)
# shopping_open_review_years.append(pd.to_datetime(shopping_reviews_open.date.dt.year).values)
# # save the lists to file
# pickle.dump(restaurant_open_review_years, open("yelp-dataset/restaurant_open_review_years.sav", 'wb'))
# pickle.dump(restaurant_closed_review_years, open("yelp-dataset/restaurant_closed_review_years.sav", 'wb'))
# pickle.dump(shopping_open_review_years, open("yelp-dataset/shopping_open_review_years.sav",'wb'))
# pickle.dump(shopping_closed_review_years, open("yelp-dataset/shopping_closed_review_years.sav", 'wb'))
# load lists from file
restaurant_open_review_years = pickle.load(open("yelp-dataset/restaurant_open_review_years.sav", 'rb'))
restaurant_closed_review_years = pickle.load(open("yelp-dataset/restaurant_closed_review_years.sav", 'rb'))
shopping_open_review_years = pickle.load(open("yelp-dataset/shopping_open_review_years.sav", 'rb'))
shopping_closed_review_years = pickle.load(open("yelp-dataset/shopping_closed_review_years.sav", 'rb'))
restaurant_open_review_years = np.concatenate(restaurant_open_review_years).ravel()
restaurant_closed_review_years = np.concatenate(restaurant_closed_review_years).ravel()
shopping_open_review_years = np.concatenate(shopping_open_review_years).ravel()
shopping_closed_review_years = np.concatenate(shopping_closed_review_years).ravel()
fig, (ax1,ax2) = plt.subplots(2,1,figsize=(16,6), sharex= True, sharey=False)
sns.set(rc={'figure.figsize':(4,4)})
sns.distplot(restaurant_open_review_years.astype(int), kde=False, label="open", ax=ax1)
sns.distplot(restaurant_closed_review_years.astype(int), kde=False, label = "closed", ax=ax1)
ax1.legend()
ax1.set_ylabel("Count")
ax2.set_xlabel("Year")
ax1.set_title("Restaurants Review")
sns.distplot(shopping_open_review_years.astype(int), kde=False, label="open", ax=ax2)
sns.distplot(shopping_closed_review_years.astype(int), kde=False, label = "closed", ax=ax2)
ax2.set_title("Shopping Review")
ax2.set_xlabel("Year")
ax1.set_ylabel("Count")
sns.despine()
ax2.legend()
plt.show()
The information on when a restaurant is closed is not provided by Yelp. We can only approximate the rate of closure based on the distrubiton of reviews. The graph shows that restaurants started recieving reviews on Yelp since 2004 and the reviews continued to increase. The year 2017 is the peak year for open restaurants. Many restaurants that were reviewed in 2014/15 are closed. We can not be sure whether they were closed in the same year or later.
Many shopping businesses that were reviewed in 2014-15 are closed by now.
#The following code takes 15 minutes. Saved data is loaded in the next cell
# csize = 1e5
# review_reader = pd.read_json("yelp_academic_dataset_review.json", lines=True, chunksize=csize)
# total_reviews = 0
# #restaurant_open_last_review_year = []
# #hopping_open_last_review_year =
# #hopping_closed_last_review_year = []
# for i, r in enumerate(review_reader):
# with warnings.catch_warnings():
# warnings.simplefilter("ignore")
# print("working on chunk ",i," shape =", r.shape[0])
# restaurant_reviewsR = pd.merge(restaurants_df, r, on=["business_id"])
# restaurant_reviewsR["review_year"] = pd.to_datetime(restaurant_reviewsR.date.dt.year)
# if(i == 0):
# restaurants_open_year = restaurant_reviewsR[["business_id", "review_year"]].groupby("business_id")["review_year"].min()
# restaurants_closed_year = restaurant_reviewsR[["business_id", "review_year"]].groupby("business_id")["review_year"].max()
# else:
# restaurants_open_year = pd.concat([restaurants_open_year, restaurant_reviewsR[["business_id", "review_year"]].groupby("business_id")["review_year"].min()])
# restaurants_closed_year = pd.concat([restaurants_closed_year, restaurant_reviewsR[["business_id", "review_year"]].groupby("business_id")["review_year"].max()])
# # save the dataframes to file
# pickle.dump(restaurants_open_year, open("yelp-dataset/restaurants_open_year.sav", 'wb'))
# pickle.dump(restaurants_closed_year, open("yelp-dataset/restaurants_closed_year.sav", 'wb'))
# open and close in the file names refer to earliest review year and most recent review years
# it is a naming mistake made when dumping the objects
earliest_review_year_all = pickle.load(open("yelp-dataset/restaurants_open_year.sav", 'rb'))
most_recent_review_year_all = pickle.load(open("yelp-dataset/restaurants_closed_year.sav", 'rb'))
earliest_review_year = earliest_review_year_all.groupby("business_id").min()
most_recent_review_year = most_recent_review_year_all.groupby("business_id").max()
restaurants_dates = pd.concat([earliest_review_year, most_recent_review_year], axis=1)
restaurants_dates.columns = ["earliest_review_year", "most_recent_review_year"]
# filter required attributes
restaurants_only = pd.merge(b[["business_id", "name", "state", "latitude", "longitude", "stars","review_count", "is_open"]],
restaurants_dates, on="business_id")
restaurants_only.head()
fig, (ax1,ax2) = plt.subplots(2,1,figsize=(12,8), sharex=True)
fig.subplots_adjust(hspace = 0.4, wspace = 0.4)
sns.set(rc={'figure.figsize':(4,4)})
open_earliest_years = restaurants_only[restaurants_only.is_open == True].earliest_review_year.values.astype(int)
open_most_recent_years = restaurants_only[restaurants_only.is_open == True].most_recent_review_year.values.astype(int)
closed_earliest_years = restaurants_only[restaurants_only.is_open == False].earliest_review_year.values.astype(int)
closed_most_recent_years = restaurants_only[restaurants_only.is_open == False].most_recent_review_year.values.astype(int)
sns.distplot(open_earliest_years, kde=False, label="newly opened", ax=ax1)
sns.distplot(open_most_recent_years, kde=False, label = "most recent", ax=ax1)
sns.distplot(closed_earliest_years, kde=False, label="newly opened", ax=ax2)
sns.distplot(closed_most_recent_years, kde=False, label = "out of business", ax=ax2)
ax1.set_title("Open Restaurants")
ax2.set_title("Closed Restaurants")
ax1.legend(loc="upper left")
ax2.legend(loc="upper left")
ax1.set_ylabel("Count")
ax2.set_ylabel("Count")
ax2.set_xlabel("Year")
plt.suptitle("Review Year")
sns.despine()
We are approximating the earliest year as the opening year of a restaurant. For currently closed restaurants, the most recent year is approximated as the out of business year.
Looking at the upper graph; almost all currently open restaurants were reviewed within the last 2 years. The distriubtion of earliest review year is left skewed. Since 2010 similar number of restaurants were opened
Looking at the bottom graph; number of newly opened restaurants is normally distributed for closed restaurants. Many restaurants that were opened in 2010 are closed by now. Restaurants that are recently opened are not closed by now. This indicates that new restaurants are not very likely to be closed and it takes some years. The difference betwen number new restaurants(red) and closed restaurants(blue) kept on increasing until 2013 and then it decreases.
They are many contributing factors that are outside the scope of this analysis.
len(set(restaurants_only.business_id))
There are 57173 restaurants on Yelp
# subtract earliest review year from 2018
restaurants_only["age"] = 2018 - restaurants_only.earliest_review_year.values.astype(int)
# select features
restaurants_only = restaurants_only[['business_id', 'name', "state", 'latitude', 'longitude', 'stars', 'review_count','age', 'is_open']]
restaurants_only.age.describe()
The maximum age is only 14 years because Yelp started in 2004
sns.set(rc={'figure.figsize':(8,6)})
sns.countplot(restaurants_only.is_open)
plt.title("Imbalanced Classes")
plt.xticks([0,1], ("Closed", "Open"))
plt.xlabel("Class")
plt.ylabel("Frequency")
sns.despine()
plt.show()
pd.value_counts(restaurants_only.is_open)/restaurants_only.shape[0]
pd.value_counts(restaurants_only.is_open)
Open and Closed Restaurants are not as imbalanced as they are for all busineses. Out of all businesses 83% of them are open but among all restaurants 72% of them are open.
restaurants_only.head()
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(8,6), sharey=True)
# making spaces between subplots
#fig.subplots_adjust(hspace = 0.2, wspace = 0.2)
ax1.set_xlabel("Num of Reviews")
ax2.set_xlabel("Num of Reviews")
ax1.set_ylabel("Count")
ax2.set_ylabel("Count")
sns.kdeplot(restaurants_only.review_count.sort_values(), ax=ax1)
sns.kdeplot(restaurants_only.review_count.sort_values(), ax=ax2)
ax2.set_xlim(0, 2000)
plt.show()
sns.boxplot(restaurants_only.review_count.sort_values())
plt.show()
from scipy import stats
z = np.abs(stats.zscore(restaurants_only.review_count))
threshold = 3
outlier_businesses = list(np.where(z > 3)[0])
print("Number of restaurants: ", restaurants_only.shape[0])
print("Number of outliers: ", len(outlier_businesses))
Removing 786 businesses will not harm because there is plenty of data.
business_ids = restaurants_only.iloc[outlier_businesses, 0].to_frame()
pd.merge(business_ids, b, on=["business_id"])["is_open"].value_counts()
print(round(45/(741 + 45) * 100), "%")
6% of the outliers are closed. Compared to 27% closed , removing them will not harm the data.
pd.merge(business_ids, b, on=["business_id"])[["name", "state", "review_count", "address"]].sample(10)
These must be very popular places because they are highly reviewed. They may create problems during text processing because a lot of words will be used in their reviews.
# check shape of dataframe before removal
restaurants_only.shape
# remove outlier businesses
restaurants_only = restaurants_only.drop(outlier_businesses)
# check shape again
restaurants_only.shape
# Look 786 indices are missing from the data
pd.RangeIndex(stop=57173).difference(restaurants_only.index).shape
restaurants_only = restaurants_only.reset_index(drop=True)
# now index is reset correctly
pd.RangeIndex(stop=57173 - 786).difference(restaurants_only.index).shape
restaurants_only.is_open.value_counts()
# !python -m spacy download en
# import nltk
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('wordnet')
def enable_plotly_in_cell():
import IPython
from plotly.offline import init_notebook_mode
display(IPython.core.display.HTML('''
<script src="/static/components/requirejs/require.js"></script>
'''))
init_notebook_mode(connected=False)
from plotly.offline import iplot
import plotly.graph_objs as go
enable_plotly_in_cell()
data = [
go.Scattermapbox(lat=restaurants_only.latitude, lon=restaurants_only.longitude,
mode='markers',
text=restaurants_only.name)
]
layout = go.Layout(
autosize=True,
hovermode='closest',
mapbox=dict(
accesstoken="pk.eyJ1IjoicHJpeWF0aGFyc2FuIiwiYSI6ImNqbGRyMGQ5YTBhcmkzcXF6YWZldnVvZXoifQ.sN7gyyHTIq1BSfHQRBZdHA",
bearing=0,
center=dict(
lat=38.92,
lon=-77.07
),
pitch=0,
zoom=1
),
width=900,
height=600,
title = "World"
)
fig = dict(data=data, layout=layout)
iplot(fig, filename='Multiple Mapbox')
The restaurants are from all over the world. North America, South America, Europe, Asia and Antartica are included
Yelp dataset do not contain country data, but state is avaialble. US and Canda states are popular and well known. For the other countries, I check the longitude and latitude data to determine the country using google api
restaurants_only.state.unique()
len(restaurants_only)
# us_states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
# "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
# "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
# "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
# "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
# canada_states = ['ON', 'QC', 'NS', 'NB', 'MB', 'BC', 'PE', 'SK', 'AB', 'NL']
# from geopy.geocoders import GoogleV3
# # registered the following google maps geoencoding api
# apikeys = ["AIzaSyDta3bVjrSokrV2xc2MaQvN-7JFv4BkAYg", "AIzaSyCxImBiEldxwzgW10PhfXVW1A_OSZqNecI"]
# # use it by alternating randomly
# geoencoder = GoogleV3(api_key=np.random.choice(apikeys))
# # a function to find country
# def find_country(row):
# # http://en.wikipedia.org/wiki/Extreme_points_of_the_United_States#Westernmost
# top = 49.3457868 # north lat
# left = -124.7844079 # west long
# right = -66.9513812 # east long
# bottom = 24.7433195 # south lat
# # some states have same abberviation like USA. The check below will help.
# inside_usa = bottom <= row['latitude'] <= top and left <= row['longitude'] <= right
# if(inside_usa and row.state.strip() in us_states):
# return "USA"
# elif(row.state.strip() in canada_states):
# return "Canada"
# else:
# latlon = str(row['latitude']) + ", " + str(row['longitude'])
# location = geoencoder.reverse(latlon, exactly_one=True)
# if location is not None:
# return location.address.split(",")[-1].strip()
# else:
# return "Not-Found"
# # create new feature: country
# countries_googleapi = pd.DataFrame(restaurants_only.apply(lambda row: find_country(row), axis=1))
# countries_googleapi.to_csv("countries_googleapi.csv", index=None)
# create country column
restaurants_only['country'] = pd.read_csv("countries_googleapi.csv")
# for testing
# from geopy.geocoders import GoogleV3
# #location = geolocator.reverse("43.748826, -79.347592", exactly_one=True)
# location.address.split(",")[-1].strip()
# value counts
restaurants_only.country.value_counts()
The documentation of Yelp dataset claims that the data is from 11 meteropolitan areas in the US and Canada. But the data tells a different story. Some restaurants from other countries are included
restaurants_only[~restaurants_only.country.isin(["USA", "Canada"])].sample(5)
# remove other countries
indices = restaurants_only[~restaurants_only.country.isin(["USA", "Canada"])].index
restaurants_only = restaurants_only.drop(indices).reset_index(drop=True)
# check value counts again
restaurants_only.country.value_counts()
len(restaurants_only)
def enable_plotly_in_cell():
import IPython
from plotly.offline import init_notebook_mode
display(IPython.core.display.HTML('''
<script src="/static/components/requirejs/require.js"></script>
'''))
init_notebook_mode(connected=False)
from plotly.offline import iplot
import plotly.graph_objs as go
enable_plotly_in_cell()
data = [
go.Scattermapbox(lat=restaurants_only.latitude, lon=restaurants_only.longitude,
mode='markers',
text=restaurants_only.name +", " + restaurants_only.country)
]
layout = go.Layout(
autosize=True,
hovermode='closest',
mapbox=dict(
accesstoken="pk.eyJ1IjoicHJpeWF0aGFyc2FuIiwiYSI6ImNqbGRyMGQ5YTBhcmkzcXF6YWZldnVvZXoifQ.sN7gyyHTIq1BSfHQRBZdHA",
bearing=0,
center=dict(
lat=38.92,
lon=-77.07
),
pitch=0,
zoom=1
),
width=900,
height=600,
title = "World"
)
fig = dict(data=data, layout=layout)
iplot(fig, filename='Multiple Mapbox')
Clustering by distance will successfully remove the remaining erroronus coordinates
# check any missing indices
restaurants_only.index
DBScan has two important hyper parameters, epislon and min_samples. epislon refers to maximum distance between two data points to be considered as neighbors. As eps increases the number of clusters decrease because some clusters will be merged. min_samples is the miniumum number of data points in a cluster. If it is small, It will have a lot of clusters.
They are three kind of points in core, boundary and noises. Core points are surrounding by min_samples atleast. Boundary points do not have enough neigbors to be a core point. They are directly reachable by atleast one core point. Other core points can reach indirectly through other core points.
Reachability is not a symmetric relation since, by definition, no point may be reachable from a non-core point, regardless of distance (so a non-core point may be reachable, but nothing can be reached from it). Therefore, a further notion of connectedness is needed to formally define the extent of the clusters found by DBSCAN. Two points p and q are density-connected if there is a point o such that both p and q are reachable from o. Density-connectedness is symmetric.
A cluster then satisfies two properties:
HDBSCAN has two important hyper parameters: min_samples and min_cluster_size. It doesn't need maximum distance instead it takes min_cluster_size. The cluster_size is analoguous to the area taken by the cluster.
HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.
coords = restaurants_only[['latitude', 'longitude']]
from sklearn.cluster import DBSCAN
# doing dbscan
kms_per_radian = 6371.0088
miles_per_radian = 3959
# convert one mile to radians
epsilon = 1 / miles_per_radian
# haversine distance requires coordinates to be in radians
db_clusterer = DBSCAN(eps=epsilon, min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
db_cluster_labels = db_clusterer.labels_
db_num_clusters = len(set(db_cluster_labels))
db_clusters = pd.Series([coords[db_cluster_labels == n] for n in range(db_num_clusters)])
print('Number of clusters: {}'.format(db_num_clusters))
HDBSCAN don't take maximum distance. It requires min_cluster_size and min_samples per cluster. I can use it with haversine distance. The biggest advantage of HDBSCAN is it works when data has varying densities.
import hdbscan
with warnings.catch_warnings():
warnings.simplefilter("ignore")
# 30: 80/20 split 6, 24
hdb_clusterer = hdbscan.HDBSCAN(algorithm='best', metric='haversine', min_cluster_size=30, min_samples=5)
hdb_clusterer.fit(np.radians(coords))
hdb_cluster_labels = hdb_clusterer.labels_
hdb_num_clusters = len(set(hdb_cluster_labels))
hdb_clusters = pd.Series([coords[hdb_cluster_labels == n] for n in range(hdb_num_clusters)])
print('Number of clusters: {}'.format(hdb_num_clusters))
print(len(db_clusterer.core_sample_indices_))
print(len(restaurants_only) - len(db_clusterer.core_sample_indices_))
Only 1227 restaurants have less than 5 neighbors within one mile
# core indices
restaurants_only.loc[db_clusterer.core_sample_indices_].sample(5)
# all are given labels
assert(len(restaurants_only) == len(db_cluster_labels))
# add cluster labels to the dataframe
restaurants_only['cluster_labels_db'] = db_clusterer.labels_
restaurants_only['cluster_labels_hdb'] = hdb_clusterer.labels_
# remove -1
restaurants_only_db = restaurants_only[restaurants_only.cluster_labels_db != -1].reset_index(drop=True)
restaurants_only_hdb = restaurants_only[restaurants_only.cluster_labels_hdb != -1].reset_index(drop=True)
restaurants_only_db.index
restaurants_only_hdb.index
HDBSCAN detected more outliers
# draw the map
def enable_plotly_in_cell():
import IPython
from plotly.offline import init_notebook_mode
display(IPython.core.display.HTML('''
<script src="/static/components/requirejs/require.js"></script>
'''))
init_notebook_mode(connected=False)
from plotly.offline import iplot
import plotly.graph_objs as go
enable_plotly_in_cell()
layout = go.Layout(
autosize=True,
hovermode='closest',
mapbox=dict(
accesstoken="pk.eyJ1IjoicHJpeWF0aGFyc2FuIiwiYSI6ImNqbGRyMGQ5YTBhcmkzcXF6YWZldnVvZXoifQ.sN7gyyHTIq1BSfHQRBZdHA",
bearing=0,
center=dict(
lat=38.92,
lon=-77.07
),
pitch=0,
zoom=1
),
width=900,
height=600,
title = "World"
)
data1 = [
go.Scattermapbox(lat=restaurants_only_db.latitude, lon=restaurants_only_db.longitude,
mode='markers',
text=restaurants_only_db.name +", " + restaurants_only_db.country + ", " + restaurants_only_db.state)
]
data2 = [
go.Scattermapbox(lat=restaurants_only_hdb.latitude, lon=restaurants_only_hdb.longitude,
mode='markers',
text=restaurants_only_db.name +", " + restaurants_only_db.country + ", " + restaurants_only_db.state)
]
layout.title = "DBSCAN"
fig1 = dict(data=data1, layout=layout)
iplot(fig1, filename='Multiple Mapbox1')
layout.title = "HDBSCAN"
fig2 = dict(data=data2, layout=layout)
iplot(fig2, filename='Multiple Mapbox2')
# testing delete
# rr = np.array([[1,2,3], [4,5,7]])
# np.delete(rr, (0), axis=0)
# showing num_of_samples in cluster
from collections import Counter
# sort labels by num of samples in ascending order
sorted_labels = np.array(sorted(Counter(db_cluster_labels).items(), key=lambda kv: kv[1]))
neg1loc = np.where(sorted_labels[:,0] == -1)[0][0]
# remove -1 count
sorted_labels = np.delete(sorted_labels, (neg1loc), axis=0)
sns.boxplot(sorted_labels[:,1])
plt.title("Showing outliers: All 296 clusters")
plt.show()
# 241 clusters are non-outliers out of 296 clusters
sns.boxplot(sorted_labels[:241,1])
plt.title("Smallest 241 Clusters")
plt.show()
sns.distplot(sorted_labels[:241,1])
plt.title("Distribution-Bottom 241")
plt.show()
# least num of samples
sorted_labels[:10, 1]
# highest num of samples
sorted_labels[-10:, 1][::-1]
sns.barplot(np.arange(len(restaurants_only_db.state.unique())), restaurants_only_db.state.value_counts().values)
restaurants_only_db.state.unique()
len(set(hdb_clusterer.labels_))
# showing num_of_samples in cluster
from collections import Counter
# sort labels by num of samples in ascending order
sorted_labels = np.array(sorted(Counter(hdb_cluster_labels).items(), key=lambda kv: kv[1]))
neg1loc = np.where(sorted_labels[:,0] == -1)[0][0]
# remove -1 count
sorted_labels = np.delete(sorted_labels, (neg1loc), axis=0)
sns.boxplot(sorted_labels[:,1])
plt.title("Showing outliers: All 654 clusters")
plt.show()
There are no extreme outliers. Maximum num of samples is only 400
# 606 out of 644 clusters are non-outliers
sns.boxplot(sorted_labels[:606,1])
plt.title("Smallest 606 Clusters")
plt.show()
sns.distplot(sorted_labels[:609,1])
plt.title("Distribution-Bottom 613")
plt.show()
HDBSCAN has num of samples above 80 unlike DBSCAN.
# least num of samples
sorted_labels[:10, 1]
# highest num of samples
sorted_labels[-10:, -1][::-1]
# get center most point from a set of coordinates
from shapely.geometry import MultiPoint
def get_centermost_point(coordinates):
centroid = (MultiPoint(coordinates).centroid.x, MultiPoint(coordinates).centroid.y)
centermost_point = min(coordinates, key=lambda point: great_circle(point, centroid).m)
return tuple(centermost_point)
# A function to show clusters by state DBSCAN
def plot_cluser_by_state(state_code):
state = restaurants_only_db[restaurants_only_db.state == state_code]
# get centermost point
centermost_point = get_centermost_point(state[['latitude', 'longitude']].values)
# plotting the clusters in a map
import random
plotly_data = []
for label in set(state.cluster_labels_db):
r = int(random.random() * 256)
g = int(random.random() * 256)
b = int(random.random() * 256)
the_cluster = state[state.cluster_labels_db == label ]
mapbox = go.Scattermapbox(
lat= the_cluster.latitude ,
lon= the_cluster.longitude,
mode='markers',
fillcolor = 'rgba(' + str(r) + ',' + str(g) + ',' + str(b) + ',' + str(0.5) + ')',
marker=dict(
size= 8
#color = 'rgba(' + str(r) + ',' + str(g) + ',' + str(b) + ',' + str(0.5) + ')',
),
name ='C ' + str(label) + "("+ str(len(the_cluster)) +")",
text=the_cluster.name
)
plotly_data.append(mapbox)
data = plotly_data
layout = go.Layout(autosize=False, font=dict(color='black'),
mapbox= dict(accesstoken="pk.eyJ1IjoidGdhamF2YSIsImEiOiJjanNqeDJuZHUxZGsxNDlxZ25mYWlucXlkIn0.0kR1orNq4Gv32CTmQKKTJw",
style='dark',
bearing=3,
pitch=5,
zoom=10,
center= dict(
lat= centermost_point[0],
lon= centermost_point[1])
),
width=900,
height=600, title = "One Mile Clusters - " + state_code
)
fig = dict(data=data, layout=layout)
iplot(fig)
all_states = restaurants_only_db.state.unique()
for state in all_states[3:4]:
plot_cluser_by_state(state)
# look at cluster 2 assignments.
restaurants_only_db[restaurants_only_db.cluster_labels_db == 2].state.value_counts()
restaurants_only_db[(restaurants_only_db.state == "AB") & (restaurants_only_db.cluster_labels_db == 2)]
restaurants_only_db[(restaurants_only_db.state == "BC") & (restaurants_only_db.cluster_labels_db == 2)]
b[b.business_id=="Z6ho09TTe3Y9eXwX2n5vJA"].address
b[b.business_id=="72GGspuzlRC4hhIgy-IFAw"].address
This addresses are in Ontario. It is a mistake in the data
first address
second address
# Domino's Pizza is core_point
37260 in db_clusterer.core_sample_indices_
# A vectorized implemenation of geopy.distance.great_circle function
# Adopted by Tinsae
# reference
# https://gist.github.com/rochacbruno/2883505
import math
# calculate the disance between a business and all other businesses
def distance_c(many_points, one_point):
# convert the values to float
lat1 = many_points[:,0]
lon1 = many_points[:,1]
lat2 = one_point[0]
lon2 = one_point[1]
radius = 6371 # km
#radius = 3959 # miles
dlat = np.radians(lat2 - lat1)
dlon = np.radians(lon2 - lon1)
a = np.sin(dlat/2) * np.sin(dlat/2) + np.cos(np.radians(lat1)) \
* np.cos(np.radians(lat2)) * np.sin(dlon/2) * np.sin(dlon/2)
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
d = radius * c
return d
# from geopy.distance import great_circle
# sushi8 = (43.748826, -79.347592)
# for i, j in restaurants_only.loc[db.core_sample_indices_][['latitude', 'longitude']].iterrows():
# print(great_circle((j.latitude, j.longitude), sushi8))
# find the nearest point to dominos in cluster 2
core_samples = restaurants_only_db.loc[db_clusterer.core_sample_indices_][['latitude', 'longitude']].values
dominos = [43.748826, -79.347592]
sorted_by_closeness = np.argsort(distance_c(core_samples, dominos))
# find that restaurant
print(core_samples[sorted_by_closeness[0]])
# the closest except itself: use 1 instead of 0
the_closest_core = sorted_by_closeness[1]
restaurants_only_db.loc[db_clusterer.core_sample_indices_[the_closest_core]]
# here are the top 10 closest to dominos in cluster 2
from geopy.distance import great_circle
print("{:35s}{:10s}{:10s}".format("Restaurant", "State","Distance(Miles)"))
print()
for i in range(0,11):
current = sorted_by_closeness[i]
the_indices = db_clusterer.core_sample_indices_[current]
resta = restaurants_only_db.loc[the_indices]
latlon = (resta.latitude, resta.longitude)
print("{:35s}{:10s}{:5.5f}".format(resta['name'], resta.state, great_circle(dominos, latlon).miles))
subway = [43.875795, -79.413535]
sorted_by_closeness = np.argsort(distance_c(core_samples, subway))
# here are the top 10 closest to subway in cluster2
from geopy.distance import great_circle
print("{:35s}{:10s}{:10s}".format("Restaurant", "State","Distance(Miles)"))
print()
for i in range(0,11):
current = sorted_by_closeness[i]
the_indices = db_clusterer.core_sample_indices_[current]
resta = restaurants_only_db.loc[the_indices]
latlon = (resta.latitude, resta.longitude)
print("{:35s}{:10s}{:5.5f}".format(resta['name'], resta.state, great_circle(subway, latlon).miles))
Both Subway and Dominos are in Ontario as I discovered in a previous cell.
# A function to show clusters by state HDBSCAN
def plot_cluser_by_state(state_code):
state = restaurants_only_hdb[restaurants_only_hdb.state == state_code]
# get centermost point
centermost_point = get_centermost_point(state[['latitude', 'longitude']].values)
# plotting the clusters in a map
import random
plotly_data = []
for label in set(state.cluster_labels_hdb):
r = int(random.random() * 256)
g = int(random.random() * 256)
b = int(random.random() * 256)
the_cluster = state[state.cluster_labels_hdb == label ]
mapbox = go.Scattermapbox(
lat= the_cluster.latitude ,
lon= the_cluster.longitude,
mode='markers',
fillcolor = 'rgba(' + str(r) + ',' + str(g) + ',' + str(b) + ',' + str(0.5) + ')',
marker=dict(
size= 8
#color = 'rgba(' + str(r) + ',' + str(g) + ',' + str(b) + ',' + str(0.5) + ')',
),
name ='C ' + str(label) + "("+ str(len(the_cluster)) +")",
text=the_cluster.name
)
plotly_data.append(mapbox)
data = plotly_data
layout = go.Layout(autosize=False, font=dict(color='black'),
mapbox= dict(accesstoken="pk.eyJ1IjoidGdhamF2YSIsImEiOiJjanNqeDJuZHUxZGsxNDlxZ25mYWlucXlkIn0.0kR1orNq4Gv32CTmQKKTJw",
style='dark',
bearing=3,
pitch=5,
zoom=10,
center= dict(
lat= centermost_point[0],
lon= centermost_point[1])
),
width=900,
height=600, title = "One Mile Clusters - " + state_code
)
fig = dict(data=data, layout=layout)
iplot(fig)
all_states = restaurants_only_hdb.state.unique()
for state in all_states[3:6]:
plot_cluser_by_state(state)
sns.set(rc={'figure.figsize':(8,6)})
sns.countplot(restaurants_only_hdb.is_open)
plt.title("Imbalanced Classes")
plt.xticks([0,1], ("Closed", "Open"))
plt.xlabel("Class")
plt.ylabel("Frequency")
sns.despine()
plt.show()
restaurants_only_hdb.is_open.value_counts()
This project uses geospatial features for surrounding performance. In simple random undersampling all businesses have equal chance of being selected. This creates a sample where many restaurants have no near neighbour. Simple random undersampling don't consider similarities among datapoints.
I try to maximize the availablilty of surrounding restaurants within one mile of all sample restaurants. Here are the steps to follow
rus = RandomUnderSampler(sampling_strategy=the_dict)
newX, newY = rus.fit_sample(X, y)
The cluster label is part of the dataframe. I do train test split stratified by the cluster label.
The first 3 steps are done in the previous cells
restaurants_only_hdb.head()
restaurants_only_hdb_open = restaurants_only_hdb[restaurants_only_hdb.is_open == 1]
from imblearn.under_sampling import RandomUnderSampler
cluster_labels_unique = sorted(restaurants_only_hdb_open.cluster_labels_hdb.unique())
cluster_counts = restaurants_only_hdb_open.cluster_labels_hdb.value_counts().sort_index().values
num_of_open_restaurants = restaurants_only_hdb_open.shape[0]
# undersample majority class to be equal with the mninority class
undersampled_to = np.min(restaurants_only_hdb.is_open.value_counts().values)
desired_number = {label:int((count/num_of_open_restaurants) * undersampled_to) for label, count in zip(cluster_labels_unique, cluster_counts)}
sns.distplot(list(desired_number.values()))
plt.title("Strata Size Distribution")
plt.ylabel("Probability")
plt.xlabel("Strata Size")
sns.despine()
# rus = RandomUnderSampler(sampling_strategy=desired_number)
# X = restaurants_only_hdb_open.copy()
# y = restaurants_only_hdb_open.cluster_labels_hdb
# new_X, new_y = rus.fit_sample(X, y)
# open_busin_under = pd.DataFrame(new_X, columns=restaurants_only_hdb_open.columns)
# open_busin_under.head()
# restaurants_only_hdb_closed = restaurants_only_hdb[restaurants_only_hdb.is_open == 0]
# restaurants_undersampled = pd.concat([open_busin_under, restaurants_only_hdb_closed], ignore_index=True).reset_index(drop=True)
# assert(restaurants_undersampled.shape[0] == restaurants_only_hdb_closed.shape[0] + open_busin_under.shape[0])
# # save data for future runs
# restaurants_undersampled.to_csv("yelp-dataset/restaurants_undersampled.csv")
# load saved data
restaurants_undersampled = pd.read_csv("yelp-dataset/restaurants_undersampled.csv", index_col=0)
restaurants_undersampled.shape
restaurants_undersampled.head()
restaurants_undersampled.state.value_counts()
restaurants_undersampled.review_count.describe()
import spacy
nlp = spacy.load("en", disable=['parser', 'ner'])
doc = nlp("She was dancing and walking in the garden. Flowers and plants are refreshing. I have a flower factory")
for token in doc:
print(token.lemma_)
# #using spacy lemmatizer as a class
# from spacy.lemmatizer import Lemmatizer
# from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES
# lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)
# print(nltk.pos_tag(["traumatizing"]))
# print(lemmatizer("hers", "P"))
# nltk classes used for tokenization and lemmatization
# from nltk import word_tokenize
# from nltk.stem import WordNetLemmatizer
Scikit-learn's vectorizers don't convert words into their base form by default. The custom lemmatizer/tokenizer is passed as tokenizer for CountVectorizer and TfIdfVectorizer Objects.
I tried to use NLTK's word_tokenize and WordNetLemmatizer. NLTK methods work on a specified part of speech, the default is noun. It is not flexible to work for multiple part of speech.
Spacy is very flexible. When a spacy object is created, it will automatically know the part of speech of all tokens in the given document. It converts words into their base form depending on their POS tag. The only problem with spacy is its speed. Creating a spacy document involves tagging, parsing and named entity recognition. I disabled the steps that I don't need which increased its speed.
from spacy import displacy
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})
import spacy
nlp = spacy.load("en", disable=['parser', 'ner'])
# Use custom tokenizer and lemmatizer
def lemmatizer_tokenize(a_review):
# remove extra space and new line
a_review = ' '.join(a_review.split())
a_review = nlp(a_review)
words = [token.lemma_ for token in a_review
if (
not token.is_digit and
not token.is_punct and
not token.is_stop and
token.lemma_ != "-PRON-"
)
]
return words
# testing lemmatizer
lemmatizer_tokenize("Dancing entertains people. The girl was dancing when the shooters entered the club")
The first "Dancing" is a noun. The second "dancing" is a verb. Spacy only converted the verb to its base form. Shooters is a noun, converted to shooter. "Entered" is a verb, converted to enter.
# #tokenizer pattern and replace practice
# myvect1 = CountVectorizer(stop_words = 'english')
# myvect2 = CountVectorizer(stop_words = 'english')
# somelist = pd.Series(["Tinse is great ጥሩ ˈdäNG person", "Ethiopia is my country!", "15Brown fox", "Quick", "13 it is nice"])
# somelist2 = pd.Series(["Tinse is great ጥሩ", "däNG person", "fox"])
# somelist = somelist.str.replace(r'(?u)[^\w\s]+|[\d]+', '')
# somelist2 = somelist2.str.replace(r'(?u)[^\w\s]+|[\d]+', '')
# #print(somelist)
# transformed1 = myvect1.fit(somelist)
# transformed2 = myvect2.fit(somelist2)
# #print(myvect.vocabulary_)
# myvect1.get_feature_names()
# myvect2.get_feature_names()
# the_vocabulary = set(sorted(myvect1.get_feature_names() + myvect2.get_feature_names()))
# myvectx = TfidfVectorizer(vocabulary=the_vocabulary)
# myvectx.vocabulary
# sparse.csr_matrix([1,2,3,0])
Inverse document frequency is calculated on the whole dataset. Doing train/test split after tfidf vectorization leaks data from the train set to the test set. I vectorized train and test sets separately using the same vocabulary.
Do train/test split stratified by cluster labels
# X = restaurants_undersampled.copy()
# y = restaurants_undersampled.cluster_labels_hdb
# # stratified sampling
# train, test, _, _ = train_test_split(X, y, test_size=0.2, random_state=20, shuffle=True, stratify=y)
# train = train.reset_index(drop=True)
# test = test.reset_index(drop=True)
# # sort them
# train_tfidf = train.sort_values(by="business_id").reset_index(drop=True)
# test_tfidf = test.sort_values(by="business_id").reset_index(drop=True)
# print(len(train_tfidf))
# print(len(test_tfidf))
# train_tfidf.to_csv("yelp-dataset/train_tfidf")
# test_tfidf.to_csv("yelp-dataset/test_tfidf")
train_tfidf = pd.read_csv("yelp-dataset/train_tfidf", index_col=0)
test_tfidf = pd.read_csv("yelp-dataset/test_tfidf", index_col=0)
train_tfidf.head(5)
test_tfidf.head(5)
# need a lot of memory
#reviews = pd.read_json("yelp_academic_dataset_review.json", lines=True)[["business_id", "text"]]
#reviews.head()
# #Takes 2 hours. Saved models are loaded in the next cells
# import time
# from sklearn.feature_extraction.text import TfidfVectorizer
# start_time = time.time()
# from scipy import sparse
# tfidf_train_reviews_full = pd.merge(train_tfidf[["business_id"]], reviews[["business_id", "text"]], on=["business_id"])
# # save dense matrix to file
# tfidf_train_reviews_full.to_csv("yelp-dataset/tfidf_train_reviews_full.csv")
# # tokens containing numbers, digits and underscore are ignored
# tfidf_vectorizer_train_full = TfidfVectorizer(stop_words='english', min_df=0.001, tokenizer=lemmatizer_tokenize)
# # remove punctuation and numbers
# tfidf_train_reviews_full.text = tfidf_train_reviews_full.text.str.replace(r'[^\w\s]+|[\d]+', '')
# tfidf_features_train_full = tfidf_vectorizer_train_full.fit_transform(tfidf_train_reviews_full.text)
# # save vectorizer
# pickle.dump(tfidf_vectorizer_train_full, open("yelp-dataset/tfidf_vectorizer_train_full.sav", 'wb'))
# # save sparse matrix to file
# sparse.save_npz("yelp-dataset/tfidf_features_train_full.npz", tfidf_features_train_full)
# print("shape of dense matrix: ", tfidf_train_reviews_full.shape)
# print("shape of sparse matrix:(tfidf) ", tfidf_features_train_full.shape)
# print(time.time() - start_time)
tfidf_features_train_full = sparse.load_npz("yelp-dataset/tfidf_features_train_full.npz")
tfidf_vectorizer_train_full = pickle.load(open("yelp-dataset/tfidf_vectorizer_train_full.sav", 'rb'))
tfidf_train_reviews_full = pd.read_csv("yelp-dataset/tfidf_train_reviews_full.csv", index_col=0)
# get the indices of reviews for each business id
reviews_indices_dict_tfidf_train_full = tfidf_train_reviews_full.business_id.groupby(tfidf_train_reviews_full.business_id).groups
tfidf_features_train_full.shape
# one way to ensure accuracy of data
assert(list(np.arange(0, 939226, 1)) == tfidf_train_reviews_full.index.tolist())
# 3229 words
len(tfidf_vectorizer_train_full.get_feature_names())
import time
start_time = time.time()
num_of_sparse_features = len(tfidf_vectorizer_train_full.vocabulary_)
# k goes over the business ids
# i counts
the_sparse_matrices_tf_idf_train_full = []
for i, k in enumerate(reviews_indices_dict_tfidf_train_full.keys()):
if( i % 5000 == 0):
print("creating row: ", i)
# convert to list: row indices for the reviews of k
indices_for_sparse_matrix = list(reviews_indices_dict_tfidf_train_full[k])
# find means column-wise
means_sparse = tfidf_features_train_full[indices_for_sparse_matrix, :].mean(axis=0).tolist()[0]
the_sparse_matrices_tf_idf_train_full.append(means_sparse)
pickle.dump(the_sparse_matrices_tf_idf_train_full, open("yelp-dataset/the_sparse_matrices_tf_idf_train_full.sav", "wb"))
print(time.time() - start_time)
# one way to check accuracy of data
print(all(tfidf_train_reviews_full.business_id.iloc[i] <= tfidf_train_reviews_full.business_id.iloc[i+1] for i in range(len(tfidf_train_reviews_full.business_id)-1)))
print(all(train_tfidf.business_id.iloc[i] <= train_tfidf.business_id.iloc[i+1] for i in range(len(train_tfidf.business_id)-1)))
# one way to check data consistency
assert(len(sorted(reviews_indices_dict_tfidf_train_full.keys())) == len(sorted(train_tfidf.business_id)))
# selected businesses in review_indices-dict may not same with train because train is startified and shuffled
set(reviews_indices_dict_tfidf_train_full.keys()) == set(train_tfidf.business_id)
tf_idf_aggregate_train = sparse.csr_matrix(the_sparse_matrices_tf_idf_train_full)
# merge sparse and dense features
dense_f = train_tfidf[['is_open', 'cluster_labels_hdb', 'stars', 'review_count', 'latitude', 'longitude', 'age']].astype(float)
tf_idf_train_final = hstack((dense_f, tf_idf_aggregate_train)).tocsr()
tf_idf_train_final
tf_idf_aggregate_train.shape
X_train_tf_idf = tf_idf_train_final[:,1:]
# convert it to is_closed
y_train_tf_idf = np.where(tf_idf_train_final[:,0].toarray(), 1, 0)
# #Takes 1 hour. Saved models are loaded in the next cells
# import time
# from sklearn.feature_extraction.text import TfidfVectorizer
# start_time = time.time()
# from scipy import sparse
# tfidf_test_reviews_full = pd.merge(test_tfidf, reviews, on=["business_id"])
# # save dense matrix to file
# tfidf_test_reviews_full.to_csv("yelp-dataset/tfidf_test_reviews_full.csv")
# # tokens containing numbers, digits and underscore are ignored
# # remove punctuation and numbers
# tfidf_test_reviews_full.text = tfidf_test_reviews_full.text.str.replace(r'[^\w\s]+|[\d]+', '')
# # use train vectorizer
# tfidf_features_test_full = tfidf_vectorizer_train_full.transform(tfidf_test_reviews_full.text)
# # save sparse matrix to file
# sparse.save_npz("yelp-dataset/tfidf_features_test_full.npz", tfidf_features_test_full)
# print("shape of dense matrix: ", tfidf_test_reviews_full.shape)
# print("shape of sparse matrix:(tfidf) ", tfidf_features_test_full.shape)
# print(time.time() - start_time)
tfidf_features_test_full = sparse.load_npz("yelp-dataset/tfidf_features_test_full.npz")
tfidf_test_reviews_full = pd.read_csv("yelp-dataset/tfidf_test_reviews_full.csv", index_col=0)
# get the indices of reviews for each business id
reviews_indices_dict_tfidf_test_full = tfidf_test_reviews_full.business_id.groupby(tfidf_test_reviews_full.business_id).groups
import time
start_time = time.time()
num_of_sparse_features = len(tfidf_vectorizer_train_full.vocabulary_)
# k goes over the business ids
# i counts
the_sparse_matrices_tf_idf_test_full = []
for i, k in enumerate(reviews_indices_dict_tfidf_test_full.keys()):
if( i % 1000 == 0):
print("creating row: ", i)
# convert to list: row indices for the reviews of k
indices_for_sparse_matrix = list(reviews_indices_dict_tfidf_test_full[k])
# find means column-wise
means_sparse = tfidf_features_test_full[indices_for_sparse_matrix, :].mean(axis=0).tolist()[0]
the_sparse_matrices_tf_idf_test_full.append(means_sparse)
pickle.dump(the_sparse_matrices_tf_idf_test_full, open("yelp-dataset/the_sparse_matrices_tf_idf_test_full.sav", "wb"))
print(time.time() - start_time)
# test consistency
assert(len(sorted(reviews_indices_dict_tfidf_test_full.keys())) == len(sorted(test_tfidf.business_id)))
# selected businesses in review_indices-dict may not same with train because train is startified and shuffled
set(reviews_indices_dict_tfidf_test_full.keys()) == set(test_tfidf.business_id)
tf_idf_aggregate_test = sparse.csr_matrix(the_sparse_matrices_tf_idf_test_full)
# merge sparse and dense features
dense_f = test_tfidf[['is_open', 'cluster_labels_hdb', 'stars', 'review_count', 'latitude', 'longitude', 'age']].astype(float)
tf_idf_test_final = hstack((dense_f, tf_idf_aggregate_test)).tocsr()
tf_idf_test_final
X_test_tf_idf = tf_idf_test_final[:,1:]
# convert it to is_closed
y_test_tf_idf = np.where(tf_idf_test_final[:,0].toarray(), 1, 0)
assert(len(test_tfidf) == len(reviews_indices_dict_tfidf_test_full.keys()))
tf_idf_aggregate_test.shape
test_tfidf.shape
X_test_tf_idf.shape
# #Takes 2 hours. Saved models are loaded in the next cells
# import time
# from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# start_time = time.time()
# from scipy import sparse
# restaurants_reviews_full = pd.merge(restaurants_undersampled, reviews, on=["business_id"])
# # save dense matrix to file
# restaurants_reviews_full.to_csv("yelp-dataset/restaurants_reviews_full.csv")
# # tokens containing numbers, digits and underscore are ignored
# restaurants_vectorizer_bow_full = CountVectorizer(stop_words='english', min_df=0.001, tokenizer=lemmatizer_tokenize)
# # remove punctuation and numbers
# restaurants_reviews_full.text = restaurants_reviews_full.text.str.replace(r'[^\w\s]+|[\d]+', '')
# bow_features_full = restaurants_vectorizer_bow_full.fit_transform(restaurants_reviews_full.text)
# # save vectorizer, the vectorizer contains the vocabulary
# pickle.dump(restaurants_vectorizer_bow_full, open("yelp-dataset/restaurants_vectorizer_bow_full.sav", 'wb'))
# # save sparse matrix to file
# sparse.save_npz("yelp-dataset/bow_features_full.npz", bow_features_full)
# print("shape of dense matrix: ", restaurants_reviews_full.shape)
# print("shape of sparse matrix:(tfidf) ", bow_features_full.shape)
# print(time.time() - start_time)
bow_features_full = sparse.load_npz("yelp-dataset/bow_features_full.npz")
restaurants_vectorizer_bow_full = pickle.load(open("yelp-dataset/restaurants_vectorizer_bow_full.sav", 'rb'))
restaurants_reviews_full = pd.read_csv("yelp-dataset/restaurants_reviews_full.csv", index_col=0)
# get the indices of reviews for each business id
reviews_indices_dict_bow_full = restaurants_reviews_full.business_id.groupby(restaurants_reviews_full.business_id).groups
print(bow_features_full.shape)
print(restaurants_reviews_full.shape)
import time
#takes 10 minutes
start_time = time.time()
num_of_sparse_features = len(restaurants_vectorizer_bow_full.vocabulary_)
# k goes over the business ids
# i counts
the_sparse_matrices_bag_full = []
# closed restaurants
for i, k in enumerate(reviews_indices_dict_bow_full.keys()):
if( i % 10000 == 0):
print("creating row: ", i)
# convert to list: row indices for the reviews of k
indices_for_sparse_matrix = list(reviews_indices_dict_bow_full[k])
# find means column-wise
means_sparse = bow_features_full[indices_for_sparse_matrix, :].mean(axis=0).tolist()[0]
the_sparse_matrices_bag_full.append(means_sparse)
# save aggregate sparse matrix
pickle.dump(the_sparse_matrices_bag_full, open("yelp-dataset/the_sparse_matrices_bag_full.sav", "wb"))
print(time.time() - start_time)
restaurants_bow_aggregate = sparse.csr_matrix(the_sparse_matrices_bag_full)
restaurants_bow_aggregate
restaurants_undersampled.shape
# merge sparse and dense features--include cluster labels
dense_f = restaurants_undersampled[['is_open', 'cluster_labels_hdb', 'stars', 'review_count', 'latitude', 'longitude', 'age']].astype(float)
restaurants_bow_final = hstack((dense_f, restaurants_bow_aggregate)).tocsr()
restaurants_bow_final
X = restaurants_bow_final
# use cluster label to stratify train_test_split
y = restaurants_bow_final[:, 1].toarray()
# stratified sampling
train_bow, test_bow, _, _ = train_test_split(X, y, test_size=0.2, random_state=20, shuffle=True, stratify=y)
train_bow.shape
test_bow.shape
# hdbclusterlabel, dense features, sparse features
# the cluster label will be used for cross validation and icf penalty
X_train_bow = train_bow[:, 1:]
# is_closed?
y_train_bow = np.where(train_bow[:, 0].toarray() == 0, 1, 0)
X_test_bow = test_bow[:, 1:]
y_test_bow = np.where(test_bow[:, 0].toarray() ==0, 1, 0)
X_train_bow.shape
X_test_bow.shape
X_train_bow[:,0].toarray()
count_classes = pd.value_counts(y_train_bow.flatten(), sort=True).sort_index()
count_classes.plot(kind = 'bar')
plt.title("Balanced Classes")
plt.xlabel("Class")
plt.ylabel("Frequency")
plt.show()
print(count_classes)
count_classes = pd.value_counts(y_test_bow.flatten(), sort=True).sort_index()
count_classes.plot(kind = 'bar')
plt.title("Balanced Classes")
plt.xlabel("Class")
plt.ylabel("Frequency")
plt.show()
print(count_classes)
np.where(X_train_bow[:, 0].toarray().flatten() == 3.0)[0]
from sklearn.base import BaseEstimator, TransformerMixin
class ICFPenalty(BaseEstimator, TransformerMixin):
def transform(self, X, *_):
#ipdb.set_trace()
self.X = X
total_number_of_businesses = X.shape[0]
cluster_indices = set(self.X[:, 0].toarray().flatten())
for c_index in cluster_indices:
# indices of cluster members
members_indices = np.where(X[:, 0].toarray().flatten() == c_index)[0]
# other cluster memebers
non_members_indices = np.where(X[:, 0].toarray().flatten() != c_index)[0]
# textual features start from six
# how many times a word occured in other clusters
denominator = (1 + np.array(self.X[non_members_indices, 6:].sum(axis=0).tolist()[0]))
idf = np.log10( total_number_of_businesses / denominator).reshape(1,-1)
self.X[members_indices, 6:] = sparse.csr_matrix(self.X[members_indices, 6:].toarray() * idf)
return self.X
def fit_transform(self, X, *_):
return self.transform(X)
def fit(self, *_):
return self
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import Pipeline, make_pipeline
pipeline = make_pipeline(ICFPenalty(), Normalizer())
# the cluster label is on the to the front
X_train_bow_penalized = pipeline.fit_transform(X_train_bow)
# same target classes: y_train_bow and y_test_bow
X_test_bow_penalized = pipeline.transform(X_test_bow)
# same target classes: y_test_bow and y_test_bow
# # Load Google's pre-trained Word2Vec model.
# import gensim
# from gensim.models import word2vec
# google_model = gensim.models.KeyedVectors.load_word2vec_format ('/home/tigial3535/google_word_vectors.bin.gz', binary=True)
# restaurants_vectorizer_bow2.get_feature_names()
# print(google_model.wv.most_similar(['order'], topn=30))
# google_model.wv.most_similar
Within 1 mile circle, average ratings, average number of reviews
Training data supplied to this transformer will have stars, review_count, latitude, longitude and age. In a pipeline Distance Transformer is the last one. Bag of words features contain the target class. So the a pipeline step should handle it correctly
X_train_bow.shape
These are 3229 textual features, 1 cluster label, 5 dense features
# I can use one of the X_trains for testing surrounding restaurants
print("min latitude", X_train_bow[:,3].min())
print("max latitude", X_train_bow[:,3].max())
print("min longtiude", X_train_bow[:,4].min())
print("max longtiude", X_train_bow[:,4].max())
X_train_sep = pd.DataFrame({'average_stars':X_train_bow[:,1].toarray().flatten(),
'review_count':X_train_bow[:,2].toarray().flatten(),
'latitude':X_train_bow[:,3].toarray().flatten(),
'longitude':X_train_bow[:,4].toarray().flatten()})
X_train_sep.head()
# A vectorized implemenation of geopy.distance.great_circle function
# Adopted by Tinsae
# reference
# https://gist.github.com/rochacbruno/2883505
import math
# calculate the disance between a business and all other businesses
def distance_c(many_points, one_point):
lat1 = many_points[:,0]
lon1 = many_points[:,1]
lat2 = one_point[0]
lon2 = one_point[1]
#radius = 6371 # km
radius = 3959 # miles
dlat = np.radians(lat2 - lat1)
dlon = np.radians(lon2 - lon1)
a = np.sin(dlat/2) * np.sin(dlat/2) + np.cos(np.radians(lat1)) \
* np.cos(np.radians(lat2)) * np.sin(dlon/2) * np.sin(dlon/2)
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
d = radius * c
return d
# for testing purposes(calculating the distance between buisness@1 and others)
business5 = X_train_sep[['latitude', 'longitude']].values[5:6]
business8 = X_train_sep[['latitude', 'longitude']].values[8:9]
business18 = X_train_sep[['latitude', 'longitude']].values[18:19]
business1 = X_train_sep[['latitude', 'longitude']].values[1]
print("distance from 5: ", great_circle(business5, business1).miles)
print("distance from 8: ", great_circle(business8, business1).miles)
print("distance from 18: ", great_circle(business18, business1).miles)
distance_c(X_train_sep[['latitude', 'longitude']].values[0:20], business1)
from sklearn.base import BaseEstimator, TransformerMixin
class DistanceTransformer(BaseEstimator, TransformerMixin):
def transform(self, X, *_):
self.nearest_average_rating=[]
self.nearest_average_num_of_reviews =[]
self.X = X
self.X_sep = pd.DataFrame({'average_stars':X[:,0].toarray().flatten(),
'review_count':X[:,1].toarray().flatten(),
'latitude':X[:,2].toarray().flatten(),
'longitude':X[:,3].toarray().flatten()})
self.X_sep.apply(lambda current: self.create_distance_features(current), axis=1)
nearest_average_rating = np.array(self.nearest_average_rating).reshape(-1,1)
nearest_average_num_of_reviews = np.array(self.nearest_average_num_of_reviews).reshape(-1,1)
# add two featuers after the cluster labels
return sparse.hstack((self.X[:, 0].astype(float), nearest_average_rating, nearest_average_num_of_reviews, self.X[:, 1:].astype(float)))
# calculate the disance between a business and all other businesses
def distance_c(self, many_points, one_point):
# convert the values to float
lat1 = many_points[:,0]
lon1 = many_points[:,1]
lat2 = one_point[0]
lon2 = one_point[1]
#radius = 6371 # km
radius = 3959 # miles
dlat = np.radians(lat2 - lat1)
dlon = np.radians(lon2 - lon1)
a = np.sin(dlat/2) * np.sin(dlat/2) + np.cos(np.radians(lat1)) \
* np.cos(np.radians(lat2)) * np.sin(dlon/2) * np.sin(dlon/2)
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
d = radius * c
return d
def create_distance_features(self, current):
# find the distance between current row and all other rows using the vectorized function
the_distances = self.distance_c(self.X_sep[['latitude', 'longitude']].values, (current.latitude, current.longitude))
# A distance cannot be negative. 0 indicates it is the same place
# so both shall be excluded
indices_to_select = np.where((the_distances < 1) & (the_distances > 0))[0]
# if a business within one mile is found
if(len(indices_to_select) > 0):
self.nearest_average_rating.append(self.X_sep.iloc[indices_to_select, :].average_stars.mean())
self.nearest_average_num_of_reviews.append(self.X_sep.iloc[indices_to_select, :].review_count.mean())
else:
self.nearest_average_rating.append(np.nan)
self.nearest_average_num_of_reviews.append(np.nan)
def fit_transform(self, X, *_):
return self.transform(X)
def fit(self, *_):
return self
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import Normalizer, StandardScaler
class CustomNormalizer(BaseEstimator, TransformerMixin):
def transform(self, X, *_):
self.X = X
# column 5 and 6 are removed(latitude and longitude)
# cluster labels are not scaled, standard scaler("nearest.., nearest.., stars, review_count,age ),
# normalizer(text features)
return sparse.hstack((self.X[:, 0], StandardScaler().fit_transform(self.X[:, [1,2,3,4,7]].toarray()),
Normalizer().fit_transform(self.X[:, 8:]) )).tocsr()
def fit_transform(self, X, *_):
return self.transform(X)
def fit(self, *_):
return self
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import Normalizer, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD
class LSA(BaseEstimator, TransformerMixin):
def __init__(self, dim):
self.dim = dim
def fit(self, X, *_):
self.X = X.tocsr()
self.svd = TruncatedSVD(self.dim)
lsa = make_pipeline(self.svd, Normalizer(copy=False))
lsa.fit(self.X[:, 6:])
variance_explained = self.svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components:",total_variance * 100)
return self
def transform(self, X, *_):
# reduce the textual features only
return sparse.hstack((X[:, 0:6], self.svd.transform(X[:, 6:]))).tocsr()
def fit_transform(self, X, *_):
return self.fit(X).transform(X)
# # test code
# sc = StandardScaler()
# sc.fit_transform(X_train[:,2].toarray())
# X_train[0:5, 0:5].toarray()
# X_train_transformed.tocsr()[0:5, 0:5].toarray()
# CNR = CustomNormalizer()
# X_train_transformed = CNR.transform(X_train)
# (X_train_transformed).tocsr()[0:5, 0:5].toarray()
# this is used for testing the tranformer
# DFC = DistanceFeaturesCreator()
# X_train_transformed = DFC.transform(X_train)
# (X_train_transformed).tocsr()[0:5, 0:5].toarray()
# import time
# nearest_average_rating = []
# nearest_average_num_of_reviews = []
# start_time = time.time()
# X_train_sep2 = X_train_sep.copy(deep=True)
# def create_distance_features(current):
# # find the distance between current row and all other rows using the vectorized function
# the_distances = distance_c(X_train_sep2[['latitude', 'longitude']].values, (current.latitude, current.longitude))
# # A distance cannot be negative. 0 indicates it is the same place
# # so both shall be excluded
# indices_to_select = np.where((the_distances < 1) & (the_distances > 0))[0]
# # if a business within one mile is found
# if(len(indices_to_select) > 0):
# nearest_average_rating.append(X_train_sep2.iloc[indices_to_select, :].average_stars.mean())
# nearest_average_num_of_reviews.append(X_train_sep2.iloc[indices_to_select, :].review_count.mean())
# else:
# nearest_average_rating.append("No-Nearest")
# nearest_average_num_of_reviews.append("No-Nearest")
# X_train_sep2.apply(lambda x: create_distance_features(x), axis=1)
# print(time.time() - start_time, " seconds")
# X_train_sep2["nearest_average_rating"] = nearest_average_rating
# X_train_sep2["nearest_average_num_of_reviews"] = nearest_average_num_of_reviews
# X_train_sep2.head()
# # This is for testing
# X_train_transformed
# RNN = RemoveNoNearest()
# RNN = RemoveNoNearest()
# X_transformed_again, y_transformed = RNN.transform(X_train_transformed.tocsr(), y_train)
# X_transformed_again = RNN.transform(X_train_transformed.tocsr())
# X_train_transformed
# X_transformed_again
# X_train_transformed.tocsr()[312, 0:2].toarray()
# np.isnan(X_transformed_again.toarray()).sum()
# from sklearn.preprocessing import StandardScaler
# with warnings.catch_warnings():
# warnings.simplefilter("ignore")
# scaler = StandardScaler()
# dense_train_sel_sorted[dense_train_sel_sorted.columns.drop("is_closed")] = scaler.fit_transform(dense_train_sel_sorted[dense_train_sel_sorted.columns.drop("is_closed")])
# dense_train_sel_sorted.head(5)
from sklearn.impute import SimpleImputer
pipeline = make_pipeline(DistanceTransformer(), SimpleImputer(strategy="mean"), CustomNormalizer())
X_train_bow_fplot = pipeline.fit_transform(X_train_bow)
# create dataframe for plotting purpose
column_headers = ['nearest_average_rating', 'nearest_averge_number_of_reviews','average_stars',
'review_count', 'age_of_restaurant']
df = pd.DataFrame(X_train_bow_fplot.tocsr()[:, 1:6 ].toarray(), columns=column_headers)
df.head()
# Code from https://seaborn.pydata.org/examples/many_pairwise_correlations.html
sns.set(style="white")
# Compute the correlation matrix
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
sns.despine()
sns.pairplot(df)
sns.despine()
I am using undersampled data to test logistic regression and decsion tree models. Other estimators don't perform good. The pipelines transform train and test set separately. I tried to use three forms of textual features.
from sklearn.metrics import classification_report, confusion_matrix,roc_auc_score,average_precision_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import lightgbm as lgb
# these are the first 5 features(1,2,3,4,5), 0 is cluster label
first_5 = ["nearest_average_rating", "nearest_averge_number_of_reviews", "average_stars", "review_count", "age_of_restaurant"]
first_5
print("total words", len(restaurants_vectorizer_bow_full.get_feature_names()) )
# these are the remaining for bag of words
restaurants_vectorizer_bow_full.get_feature_names()[:10]
print("total words", len(tfidf_vectorizer_train_full.get_feature_names()) )
# these are the remaining for bag of words
tfidf_vectorizer_train_full.get_feature_names()[:10]
# combine the features
all_the_features_bow = first_5 + list(restaurants_vectorizer_bow_full.get_feature_names())
len(all_the_features_bow)
# combine the features
all_the_features_tfidf = first_5 + list(tfidf_vectorizer_train_full.get_feature_names())
len(all_the_features_tfidf)
def estimate(X_train, y_train, X_test, y_test, model_type, penalty_arg='l2', C_arg=1.0,
max_depth_arg=50, n_estimators_arg=900, min_samples_leaf_arg=15):
if(model_type == "logistic_regression"):
model = LogisticRegression(penalty=penalty_arg, C=C_arg)
elif(model_type == "decision_tree"):
model = DecisionTreeClassifier(max_depth=10)
elif(model_type == "random_forest"):
model = RandomForestClassifier(max_depth=max_depth_arg, n_estimators=n_estimators_arg, min_samples_leaf=min_samples_leaf_arg, n_jobs=-1)
elif(model_type == "XGBoost"):
model = XGBClassifier()
elif(model_type == "lightgbm"):
lgb_train_data = lgb.Dataset(X_train, label=y_train)
params = {}
params['learning_rate'] = 0.003
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'binary_error'
params['sub_feature'] = 0.5
params['num_leaves'] = 150
params['min_data'] = 50
params['max_depth'] = 30
#training our model using light gbm
num_round=1000
model = lgb.train(params, lgb_train_data, num_round)
if(model_type == "lightgbm"):
y_pred = model.predict(X_test)
#converting probabilities into 0 or 1
for i in range(0, len(y_test)):
if y_pred[i] >= .5:
y_pred [i] = 1
else:
y_pred [i] = 0
y_pred_train = model.predict(X_train)
#converting probabilities into 0 or 1
for i in range(0,len(y_train)):
if y_pred_train[i] >= .5:
y_pred_train [i] = 1
else:
y_pred_train [i] = 0
else:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_train = model.predict(X_train)
print('\n Confusion Matrix')
print(pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted']))
print(classification_report(y_test, y_pred))
print("ROC-AUC", roc_auc_score(y_test, y_pred))
print("PR-AUC-Score", average_precision_score(y_test, y_pred))
print("Test Accuracy: ", accuracy_score(y_test, y_pred))
print("Train Accuracy: ", accuracy_score(y_train, y_pred_train))
return model
def analyze_features(feature_names, coeffs=None, model_type="logistic", limit=10):
if(model_type == "logistic"):
print("Most positive cofficeints\n(contribute positively to restaurants closure)")
print()
sorted_features = np.argsort(coeffs).ravel()[::-1]
for i in range(limit):
print(feature_names[sorted_features[i]], end=",")
print()
print("\nMost negative cofficeints\n(contribute negatively to restaurants closure)")
print()
sorted_features = np.argsort(coeffs).ravel()
for i in range(limit):
print(feature_names[sorted_features[i]],end=",")
elif(model_type=="lightgbm"):
sorted_features = np.argsort(model.feature_importance())[::-1]
for i in range(limit):
print(feature_names[sorted_features[i]], end=",")
feature_importance = model.feature_importance()
# Make importances relative to max importance.
feature_importance = 100.0 * (feature_importance / feature_importance.max())
# argsort sorts from smallest to largest
sorted_idx = np.argsort(feature_importance)[-30:]
pos = np.arange(sorted_idx.shape[0]) + .9
# # use only one column of a 1 x 2 figure
plt.subplot(1, 2, 2)
# pos is the postion of the bars on the y axis
# feature_importance[sorted_idx] is the width of the bars
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, np.array(feature_names)[[sorted_idx]])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()
else:
sorted_features = np.argsort(model.feature_importances_)[::-1]
for i in range(limit):
print(feature_names[sorted_features[i]], end=",")
# plot
feature_importance = model.feature_importances_
# Make importances relative to max importance.
feature_importance = 100.0 * (feature_importance / feature_importance.max())
# argsort sorts from smallest to largest
sorted_idx = np.argsort(feature_importance)[-30:]
pos = np.arange(sorted_idx.shape[0]) + .9
# # use only one column of a 1 x 2 figure
plt.subplot(1, 2, 2)
# pos is the postion of the bars on the y axis
# feature_importance[sorted_idx] is the width of the bars
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, np.array(feature_names)[[sorted_idx]])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()
The preprocessing is done in the previous section in this way. I split the train and test sets before a and converted each of them to tfidf vectors. The same vocabulary is used for both. Doing a train-test split after tfidf vectorization creates a weak model because test-set data will leak to the train-set.
pipeline = make_pipeline(DistanceTransformer(), SimpleImputer(strategy="mean"), CustomNormalizer())
X_train_tf_idf_t = pipeline.fit_transform(X_train_tf_idf)
X_test_tf_idf_t = pipeline.fit_transform(X_test_tf_idf)
y_train_tf_idf = y_train_tf_idf.ravel()
y_test_tf_idf = y_test_tf_idf.ravel()
# this cluster labels are untouched by the transformations
X_train_tf_idf_t[:, 0].toarray()
# logistic regression, ignore cluster labels
model = estimate(X_train_tf_idf_t[:, 1:], y_train_tf_idf, X_test_tf_idf_t[:,1:],
y_test_tf_idf, model_type="logistic_regression", penalty_arg='l2', C_arg=1.0)
analyze_features(all_the_features_tfidf, model.coef_, limit=15)
# decision tree
model = estimate(X_train_tf_idf_t[:, 1:], y_train_tf_idf, X_test_tf_idf_t[:,1:],
y_test_tf_idf, model_type="decision_tree")
analyze_features(all_the_features_tfidf, model_type="other")
In the preprocessing step, the bag of words vectors were created for all the data. And then the data is split into train and test set. Unlike the tfidf transformation, bag of words vectorization don't leak data. Every business has its own reviews and words are counted based on those reviews. Coupus-wide information like inverse document frequency is not used.
pipeline = make_pipeline(DistanceTransformer(), SimpleImputer(strategy="mean"), CustomNormalizer())
X_train_bow_t = pipeline.fit_transform(X_train_bow)
X_test_bow_t = pipeline.fit_transform(X_test_bow)
y_train_bow = y_train_bow.ravel()
y_test_bow = y_test_bow.ravel()
X_train_bow[:, 6].toarray()
# logistic regression, cluster labels are not used for estimation
model = estimate(X_train_bow_t[:, 1:], y_train_bow, X_test_bow_t[:,1:],
y_test_bow, model_type="logistic_regression", penalty_arg='l2', C_arg=0.01)
analyze_features(all_the_features_bow, model.coef_, limit=15)
# decision tree, cluster labels are not used as features
model = estimate(X_train_bow_t[:,1:], y_train_bow, X_test_bow_t[:,1:],
y_test_bow, model_type="decision_tree")
analyze_features(all_the_features_bow, model_type="other", limit=15)
pipeline = make_pipeline(DistanceTransformer(), SimpleImputer(strategy="mean"), CustomNormalizer())
X_train_bow_penalized_t = pipeline.fit_transform(X_train_bow_penalized)
X_test_bow_penalized_t = pipeline.fit_transform(X_test_bow_penalized)
# logistic regression
model = estimate(X_train_bow_penalized_t[:, 1:], y_train_bow, X_test_bow_penalized_t[:,1:],
y_test_bow, model_type="logistic_regression", penalty_arg='l2', C_arg=0.01)
analyze_features(all_the_features_bow, model.coef_, limit=15)
# decision tree
model = estimate(X_train_bow_penalized_t[:, 1:], y_train_bow, X_test_bow_penalized_t[:,1:],
y_test_bow, model_type="decision_tree")
analyze_features(all_the_features_bow, model_type="other", limit=15)
In a prior work the maximum accuracy was obtained by logistic regression which is around 72%. In this project, I applied a spacy lemmatizer for all models. Still, Logistic regression surpasses the others. It achieved around 75% accuracy in the baseline model. Hopefully, the model may improve after hyperparameter tuning
The baseline models performed better with the tfidf features. The bag of words features are not bad. I choose the bag of words features because its results are interpretable and doing gridsearchcv is possible without data leakage. Next I will try to use ensemble models
# random forest
model = estimate(X_train_bow_t[:, 1:], y_train_bow, X_test_bow_t[:,1:],
y_test_bow, model_type="random_forest")
analyze_features(all_the_features_bow, model_type="other")
# xgboost
model = estimate(X_train_bow_t[:, 1:], y_train_bow, X_test_bow_t[:,1:],
# y_test_bow, model_type="XGBoost")
analyze_features(all_the_features_bow, model_type="other", limit=15)
# lightgbm
model = estimate(X_train_bow_t[:, 1:], y_train_bow, X_test_bow_t[:,1:],
y_test_bow, model_type="lightgbm")
analyze_features(all_the_features_bow, model_type="lightgbm", limit=20)
analyze_features(all_the_features_bow, model_type="lightgbm", limit=20)
# random forest
model = estimate(X_train_bow_penalized_t[:, 1:], y_train_bow, X_test_bow_penalized_t[:,1:],
y_test_bow, model_type="random_forest")
analyze_features(all_the_features_bow, model_type="other")
# lightgbm
# # lightgbm
# model = estimate(X_train_bow_penalized_t[:, 1:], y_train_bow, X_test_bow_penalized_t[:,1:],
# y_test_bow, model_type="lightgbm")
# analyze_features(all_the_features_bow, model_type="lightgbm", limit=20)
# xgboost
model = estimate(X_train_bow_penalized_t[:, 1:], y_train_bow, X_test_bow_penalized_t[:,1:],
y_test_bow, model_type="XGBoost")
analyze_features(all_the_features_bow, model_type="other", limit=15)
pipeline = make_pipeline(DistanceTransformer(), SimpleImputer(strategy="mean"), CustomNormalizer(), LogisticRegression())
# with random split
from sklearn.model_selection import ParameterGrid, GridSearchCV
param_grid = {'logisticregression__C': [0.1, 0.2, 0.25, 0.5, 1.0, 1.25,1.5],
'logisticregression__penalty' :['l2']
}
logistic_search = GridSearchCV(pipeline, param_grid, iid=False, cv=3,
return_train_score=True, n_jobs=-1)
logistic_search.fit(X_train_bow[:, 1:], y_train_bow)
print("Best parameter (CV score=%0.3f):" % logistic_search.best_score_)
print(logistic_search.best_params_)
logistic_search.cv_results_
The bag of words contain more than 3000 features. Let me try to see their performance alone.
X_train_bow_t[:,6].toarray()
model = estimate(X_train_bow_t[:, 6:], y_train_bow, X_test_bow_t[:,6:],
y_test_bow, model_type="logistic_regression", penalty_arg='l2', C_arg=0.01)
analyze_features(all_the_features_bow[5:], model.coef_, limit=15)
# xgboost
model = estimate(X_train_bow_t[:,6:], y_train_bow, X_test_bow_t[:,6:],
y_test_bow, model_type="XGBoost")
analyze_features(all_the_features_bow, model_type="other", limit=15)
The words are similar to the onces obtained before. More research is required to know why these words are good predictors.
Let us see if the bow model can be improved by applying Latent Sematic Analyisis. The LSA class is able to fit and transform LSA on the training set and use it to transform the test set
pipeline = make_pipeline(DistanceTransformer(), SimpleImputer(strategy="mean"), CustomNormalizer(), LSA(dim=500))
X_train_bow_lsa_t = pipeline.fit_transform(X_train_bow)
X_test_bow_lsa_t = pipeline.transform(X_test_bow)
X_train_bow_lsa_t.shape
X_test_bow_lsa_t
# logistic regression
model = estimate(X_train_bow_lsa_t.tocsr()[:,6:], y_train_bow, X_test_bow_lsa_t.tocsr()[:,6:],
y_test_bow, model_type="logistic_regression", penalty_arg='l2', C_arg=0.01)
analyze_features(["svd_feature_"+ str(i) for i in range(500)], model.coef_, limit=15)
LSA is able to increase the performance of logistic regression on the test-set with just 500 features. Overfitting is reduced as well. Next I will check the full model using only the 500 bag of words
# logistic regression
model = estimate(X_train_bow_lsa_t[:,1:], y_train_bow, X_test_bow_lsa_t.tocsr()[:,1:],
y_test_bow, model_type="logistic_regression", penalty_arg='l2', C_arg=0.01)
new_set_of_features = first_5 + ["svd_feature_"+ str(i) for i in range(500)]
analyze_features(new_set_of_features, model.coef_, limit=15)
The full model didn't imporve by much. Creating custom ensembles might be a good idea.
# XGBoost
model = estimate(X_train_bow_lsa_t.tocsr()[:,6:], y_train_bow, X_test_bow_lsa_t.tocsr()[:,6:],
y_test_bow, model_type="XGBoost")
analyze_features(["svd_feature_"+ str(i) for i in range(500)], model_type="other", limit=15)
There is a huge improvement for XGBoost. Let me check the full model
# XGBoost
model = estimate(X_train_bow_lsa_t.tocsr()[:,1:], y_train_bow, X_test_bow_lsa_t.tocsr()[:,1:],
y_test_bow, model_type="XGBoost")
new_set_of_features = first_5 + ["svd_feature_"+ str(i) for i in range(500)]
analyze_features(new_set_of_features, model_type="other", limit=15)
XGBoost imporved from 50% to 64% percent
# # lightgbm
# model = estimate(X_train_bow_lsa_t[:, 1:], y_train_bow, X_test_bow_t[:,1:],
# y_test_bow, model_type="lightgbm")
# new_set_of_features = first_5 + ["svd_feature_"+ str(i) for i in range(500)]
# analyze_features(new_set_of_features, model_type="lightgbm", limit=20)
LightGBM is still overfitted but its test score is slightly higher than XBBoost
The dense features include surrounding performance, I called them distance features. They are nearest_average_num_of_reviews, nearest_rating. The others are average_stars, review_count and age of restaurant
# logistic regression
model = estimate(X_train_bow_lsa_t.tocsr()[:,1:6], y_train_bow, X_test_bow_lsa_t.tocsr()[:,1:6],
y_test_bow, model_type="logistic_regression", penalty_arg='l2', C_arg=0.01)
analyze_features(first_5, model.coef_, limit=5)
print("\n\nThe Coefficients\n")
print(first_5)
print(model.coef_.ravel())
The age of a restaurant contributes the most to restaurant closure. It is interesting to see that the nereast average number of reviews contribute to closure while more reviews helps a restaurant to survive
# XGBoost
model = estimate(X_train_bow_lsa_t.tocsr()[:,1:6], y_train_bow, X_test_bow_lsa_t.tocsr()[:,1:6],
y_test_bow, model_type="XGBoost")
analyze_features(first_5, model_type="other", limit=5)
The dense features are better predictors of restaurants closure. They achieve around 63% accuracy on the testset. After applying LSA the bag of words model shown improvement. Logistic regression improved from 53% to 64% accuracy. When the two are combined the accuracy didn't improve by much even after the improvement of the bag of words features.
Next Task: Tune Hyperparameters separately.and combine them
# logistic regression with tfidf features
model = estimate(X_train_tf_idf_t[:, 6:], y_train_tf_idf, X_test_tf_idf_t[:,6:],
y_test_tf_idf, model_type="logistic_regression", penalty_arg='l2', C_arg=0.1)
analyze_features(all_the_features_tfidf[5:], model.coef_, limit=15)
It is amazing to see the model with only tfidf features returning meaningful words, when it is trained separately
# logistic regression full model
model = estimate(X_train_tf_idf_t[:, 1:], y_train_tf_idf, X_test_tf_idf_t[:,1:],
y_test_tf_idf, model_type="logistic_regression", penalty_arg='l2', C_arg=0.1)
analyze_features(all_the_features_tfidf, model.coef_, limit=15)
tfidf features with the other features created the best model among all
# XGBoost with tfidf features
model = estimate(X_train_tf_idf_t.tocsr()[:,6:], y_train_tf_idf, X_test_tf_idf_t.tocsr()[:,6:],
y_test_tf_idf, model_type="XGBoost")
analyze_features(all_the_features_tfidf, model_type="other", limit=15)
# trying LSA
pipeline = make_pipeline(DistanceTransformer(), SimpleImputer(strategy="mean"), CustomNormalizer(), LSA(dim=500))
X_train_tf_idf_lsa_t = pipeline.fit_transform(X_train_tf_idf)
X_test_tf_idf_lsa_t = pipeline.transform(X_test_tf_idf)
X_train_tf_idf_lsa_t.shape
# logistic regression with lsa on tfidf features
model = estimate(X_train_tf_idf_lsa_t.tocsr()[:,6:], y_train_tf_idf, X_test_tf_idf_lsa_t.tocsr()[:,6:],
y_test_tf_idf, model_type="logistic_regression", penalty_arg='l2', C_arg=0.01)
analyze_features(["svd_feature_"+ str(i) for i in range(500)], model.coef_, limit=15)
# logistic regression full model after lsa is applied to textual features
model = estimate(X_train_tf_idf_lsa_t[:,1:], y_train_tf_idf, X_test_tf_idf_lsa_t.tocsr()[:,1:],
y_test_tf_idf, model_type="logistic_regression", penalty_arg='l2', C_arg=0.01)
new_set_of_features = first_5 + ["svd_feature_"+ str(i) for i in range(500)]
analyze_features(new_set_of_features, model.coef_, limit=15)
# XGBoost with lsa applied to tfidf features
model = estimate(X_train_tf_idf_lsa_t.tocsr()[:,6:], y_train_tf_idf, X_test_tf_idf_lsa_t.tocsr()[:,6:],
y_test_tf_idf, model_type="XGBoost")
analyze_features(["svd_feature_"+ str(i) for i in range(500)], model_type="other", limit=15)
# XGBoost full model
model = estimate(X_train_tf_idf_lsa_t.tocsr()[:,1:], y_train_tf_idf, X_test_tf_idf_lsa_t.tocsr()[:,1:],
y_test_tf_idf, model_type="XGBoost")
new_set_of_features = first_5 + ["svd_feature_"+ str(i) for i in range(500)]
analyze_features(new_set_of_features, model_type="other", limit=15)
XGBoost with SVD accuracy is a little greater than Logistic Regression without SVD
pipeline.named_steps['lsa'].svd.components_.shape
# component 3
top_words = np.argsort(pipeline.named_steps['lsa'].svd.components_[3])[::-1]
for w in top_words[:15]:
print(all_the_features_tfidf[w])
# component 32
top_words = np.argsort(pipeline.named_steps['lsa'].svd.components_[32])[::-1]
for w in top_words[:15]:
print(all_the_features_tfidf[w])
Health inspection results seem highly predictive of restaurants closure. I tried to obtain data about the 11 meteropolitan areas that are used for this project. Most of those areas or cities don't have complete data. Southern nevada has complete data but only 1000 restaurants match the Yelp dataset. I used more than 30000 restaurants in this project. Thus integrating health inspection data is not possible. I can do a seprate project on those 1000 restaurants.
# restaurant_inspection = pd.read_csv("../../../Data & Script/nevada_restaurant_inspection.csv").drop(["State", "Serial Number", "Employee ID", "Permit Number", "Inspection Time"], axis=1)
# restaurant_inspection.head(3)
# restaurant_inspection.columns
Technical Coaching APP [Today at 2:04 PM] [Open] Ticket :hash: 17246 (edited) Requested by @Tinsae G. Alemayehu There is latitude and longitude information. Is there anyway to split train and test stratified by geographical location? :rocket: API | Assignee: [Group] Data Science | Type: incident 47 replies
Technical Coaching APP [2 hours ago] @Tinsae, You've opened a Technical Coaching help ticket!
A technical coach will be with you when they're able. Review our schedule: http://bit.ly/tc-schedule
To make things move quicker, please include the following if appropriate:
• Checkpoint Link • GitHub link to the branch you're on
• Make sure code is up-to-date • What’s got you stuck? Be as specific as you can! • What have you tried so far?
Tiago [2 hours ago] Hey, Tinsae! Long time no see
Tinsae [2 hours ago] Hi Tiago. Glad you are here to help
Tiago [2 hours ago] so, let me see if I understand, you kinda want to split your points into, for example, north/south boundaries?
Tiago [2 hours ago] like having a line at the middle of them?
Tinsae [2 hours ago] I want both train and test sets to include businesses from all over the place
Tiago [2 hours ago] okay. Just trying to understand what you mean by "stratifying"
Tiago [2 hours ago] like, how do you want to separate the data?
Tinsae [2 hours ago] likethis.png
Tiago [2 hours ago] like an abstract art painting
Tinsae [2 hours ago] :slightly_smiling_face:
Tiago [2 hours ago] I can't really help you do that, I suck at art and did a degree in Mathematics, because that was as far as I could go
Tinsae [2 hours ago] The blue dots are on the trainset and the red dots are on the test set. I assumed 2D distance
Tiago [2 hours ago] but, if I understand correctly, you want to take random samplings space-wise. Is that it?
Tinsae [2 hours ago] Yes
Tiago [2 hours ago] that's a good doubt, and will have to research again how to do it
Tiago [2 hours ago] I like your questions, always things that are interesting
Tiago [2 hours ago] I'm trying to check if there is some already-made algorithm for it
Tinsae [2 hours ago] take your time if you have time
Tiago [2 hours ago] (if I were to do such a thing by bare hand, my first instinct is to index all the points using an R-Tree and then get one point from each of the clusters the R-Tree generated)
Tiago [2 hours ago] ((buuut I guess there are better ways to do that))
Tiago [2 hours ago] can you send me a subset of the data?
Tinsae [2 hours ago] you can use yelp businesses dataset
Tinsae [2 hours ago] let me see if I can find the data
Tiago [2 hours ago] oooooh, okay
Tiago [2 hours ago] no, no problems
Tiago [2 hours ago] and... you want to stratify your data space-wise to do what with it?
Tinsae [2 hours ago] finding nearest neighbors for each busienss
Tinsae [2 hours ago] and create aggreagated features like nearest average stars (edited)
Tinsae [2 hours ago] It was part of my supervised caspstone
Tinsae [2 hours ago] Rtrees look like KD trees. I am reading about Rtrees
Tiago [2 hours ago] please take a look into this: https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
Tinsae [1 hour ago] Those are clustering algorithms. Do I put cluster labels on the data and choose data from each cluster proportionally
Tiago [1 hour ago] I was considering two ways of doing what you asked:
1) Just use geopandas, for each business, get all business within a certain distance of it, and create an aggregate core. The problem of it is: the distribution of stores is not uniform.
2) Use HDBSCAN or make clusters (I might like that algorithm a lot) setting min_size for each cluster
Tinsae [1 hour ago] I did excatly the same as the first method with geopy. The problem is, not finding neighbors within 1 mile for many businesses because the random split may select businesses that are far way apart. (edited)
Tinsae [1 hour ago] specially for the test-set because it is limited in size (edited)
Tinsae [1 hour ago] I am checking HDBSCAN
Tiago [1 hour ago] (just as a sidenote, geopandas' distance function should be faster than geopy's, because geopandas does indexing
Tiago [1 hour ago] Or should do scanning, at least)
Tinsae [45 minutes ago] is it vectorized
Tinsae [44 minutes ago] I forgot to tell you that I adopted geopys method to make it vectorized
Tiago [43 minutes ago] me likey! (seriously)
Tiago [43 minutes ago] how did you do that?
Tinsae [37 minutes ago] I copied the code and manipulated it using numpy. Not that difficulult
Tiago [32 minutes ago] It's just that, if I recall how numpy works, the vectorization is made one layer below, at the C (or C++, I always forget) code
Tiago [31 minutes ago] (I might be fantastically wrong, though)
Tinsae [26 minutes ago] I don't how it works. There is some C behind the scene.
2
Technical Coaching APP [Feb 21st at 10:05 PM] [Solved] Ticket :hash: 17173 (edited) Requested by @Ryan Hohenhausen In Unit 4.4.3 the latent semantic analysis coded example is: svd= TruncatedSVD(130) lsa = make_pipeline(svd, Normalizer(copy=False :rocket: API | Assignee: David Reed | Type: incident 10 replies
Technical Coaching APP [1 day ago] @Ryan Hohenhausen, You've opened a Technical Coaching help ticket!
A technical coach will be with you when they're able. Review our schedule: http://bit.ly/tc-schedule
To make things move quicker, please include the following if appropriate:
• Checkpoint Link • GitHub link to the branch you're on
• Make sure code is up-to-date • What’s got you stuck? Be as specific as you can! • What have you tried so far?
Tinsae [1 day ago] I want more experienced person to answer this questions as well.
Question 1
SVD works directly on the matrices as opposed to co-variance matrices. I guess standardizing data before PCA is related to its usage of the co-variance matrix.
Question 2
SVD accepts sparse matrices and it works without converting them to dense. Tf-idf vectorizer creates a sparse matrix. Standardizing before SVD will affect the sparsity of the matrix. The dense matrix will require a lot of memory and the kernel may die. (edited)
Ryan Hohenhausen [1 day ago] Thank you. That does make sense for why we don't standardize before SVD. I still am unsure why the example normalizes after SVD as opposed to doing nothing to the data. Would there be anything wrong with skipping the pipeline all together like this: svd= TruncatedSVD(130) X_train_lsa = svd.fit_transform(X_train_tfidf) (edited)
Tinsae [1 day ago] We standardize after to make them unit vectors. It will remove the effect of long paragraphs.
Tinsae [1 day ago] Let me give simple example with just word counts paragraph1 is 100 words paragraph2 is 10 words paragraph1 good:20 times, bad 30 times,.. paragraph2 good:2 times bad:3 times...
The vector (20,30) and (2,3) show similar distribution but the large document will be more influential if we don't normalize the vectors
To normalize, simply divide by the magnitude of the vector(aka L2 norm)
Tinsae [1 day ago]
vec1 = np.array([20,30])
vec2 = np.array([2,3])
print(vec1/np.linalg.norm([20,30]))
print(vec2/np.linalg.norm([2,3]))
The above will give you
[0.5547002 0.83205029]
[0.5547002 0.83205029] (edited)
Ryan Hohenhausen [1 day ago] Okay that mostly clears it up.
Your example normalizes term frequency vectors. This problem normalizes SVD transformed tfidf vectors, but the same concept should still apply. Let me know if I'm wrong here.
Would there ever be a time when we would prefer not to normalize? Like in your example, could the absolute frequency of the terms potentially be more important than the relative frequencies? Or would this just be better handled by making paragraph length and/or vector length a feature on its own?
Tinsae [1 day ago] When you normalize, for example a short tweet with little information is treated the same as a long article on Wikipedia.
I cannot answer your question. You may try to engineer different set of features and see their performance.
Ryan Hohenhausen [1 day ago] Okay will do. Thank you! Your answers have been helpful.
David Reed (Agent) APP [1 day ago] I'm closing this one out.
[Solved] Ticket :hash: 17143 (edited) Requested by @Tinsae G. Alemayehu How much is big corpus to train a word2vect model. I have 5B user reviews. Is that good enough? The size of google news corpus is 100B. Does :rocket: API | Assignee: David Reed | Type: incident
68 replies
Technical Coaching APP [2 days ago] @Tinsae, You've opened a Technical Coaching help ticket!
A technical coach will be with you when they're able. Review our schedule: http://bit.ly/tc-schedule
To make things move quicker, please include the following if appropriate:
• Checkpoint Link • GitHub link to the branch you're on
• Make sure code is up-to-date • What’s got you stuck? Be as specific as you can! • What have you tried so far?
Tiago [2 days ago] Hey, Tinsae. You have 5 billion user reviews, like, actual reviews? Wow. Where did you get that?
Tiago [2 days ago] And let me take a look at the Google News Corpus.
Tinsae [2 days ago] Sorry I meant 5 million :slightly_smiling_face:
Tiago [2 days ago] But unfortunately I have no specific answer for you, I don't know what would be a good size.
Tinsae [2 days ago] you may check this https://github.com/3Top/word2vec-api GitHub 3Top/word2vec-api Simple web service providing a word embedding model - 3Top/word2vec-api
Tiago [2 days ago] thanks!
Tinsae [2 days ago] The size of google news corpus is 100B. Does that mean 100B words?
Tiago [2 days ago] I am trying to understand what is that 100B. Because the vocabulary size is 3 million
Tiago [2 days ago] (and anyway, where did you get the 5 million user reviews? )
Tiago [2 days ago] (not pointing fingers, just "oh, cool, I might do some cool stuff with that")
Tinsae [2 days ago] Yelp reviews from kaggle
Tiago [2 days ago] oh, okay
Tiago [2 days ago] another question: why are you using that word2vec-api?
Tinsae [2 days ago] I am not using it
Tiago [2 days ago] oooooooh, okay
Tinsae [2 days ago] The link is given in the course to check out pretrained wordnetvectors
Tiago [2 days ago] FOUND IT: about 100 billion words
Tiago [2 days ago] https://code.google.com/archive/p/word2vec/
Tiago [2 days ago] And so, I ask you, how many words do all of your reviews have?
Tinsae [2 days ago] I don't know the actual number. I created a bag of words with min_df=0.001 and obtained 3000 words out of 1M reviews.
Tiago [2 days ago] which function did you use for the bag of words?
Tinsae [2 days ago] CountVectorizer
Tiago [2 days ago] okay
Tiago [2 days ago] and, do you have a link to the raw data?
Tiago [2 days ago] or to the kaggle page
Tinsae [2 days ago] It will take more than 2 hours to count the words. I thought we could guess the word count.
https://www.kaggle.com/yelp-dataset/yelp-dataset/kernels kaggle.com Yelp Dataset A trove of reviews, businesses, users, tips, and check-in data!
Tiago [2 days ago] Are you doing the count single-threaded? Only one processor core?
Tinsae [2 days ago] njobs=-1
Tiago [2 days ago] where do you use that param njobs?
Tinsae [2 days ago] I went back to the code to find that I didn't use CountVectorizer in parallelize form. It doesn't have njobs kwarg. (edited)
Tinsae [2 days ago] But the virtual machine I used has 8 cores with 30GB ram
Tiago [2 days ago] another way to check if it uses parallelization under the core is to run the thing and monitor CPU Usage (with something like htop)
Tiago [2 days ago] (I am saying this, because a looot of times I found something was parallelized by default, got a beefy VM and saw only one processor being used)
Tinsae [2 days ago] The cpu utilization never crossed 50% (edited)
Tiago [2 days ago] what did you use to check CPU Usage?
Tinsae [2 days ago] GCP compute engine has a monitor page for every VM
Tiago [2 days ago] oh, okay
Tiago [2 days ago] well, anyway, if it wasn't bigger than 50%, definitely not parallel
Tinsae [2 days ago] monitoring.png
Tiago [2 days ago] Well, I see two ways forward here:
1) Just train the model and let's see what happens 2) I go to google colab and do some parallel code to find out how many words that corpus has
Tiago [2 days ago] I am wiling to do the second route because not many people here now. But I want to finish my avocado first.
Tinsae [2 days ago] :avocado: ok.
Tinsae [2 days ago] Can you download and upload 5GB data easily? It is large json file. (edited)
Tiago [2 days ago] my idea is to download from kaggle straight to google colab's notebook
Tinsae [2 days ago] k
Tiago [2 days ago] btw, if you have any direct links just laying around, now is the time to share
Tinsae [1 day ago] ```# import neccessary libraries from ftplib import FTP import requests
login to ftp server server = "##.##.###.##" username = "**@tinsaealemayehu.com" password = "**" ftp = FTP(server) ftp.login(user=username, passwd=password)
rvfile = open("yelp_academic_dataset_review.json", "wb") ftp.retrbinary('RETR yelp_academic_dataset_review.json', rvfile.write) rvfile.close()``` (edited)
Tinsae [1 day ago] You can use the above code. I sent you the server, username password through PM (edited)
Greg [1 day ago] I have trained doc2vec models on considerably less text, and those are just fancy word2vec models, so my 2cents are to go for it.
Tinsae [1 day ago] Thanks @Greg. Just for fun, I searched "2cents" using google news vectors and found a bunch of rock bands 'Sworn_Enemy', 0.6559710502624512), ('Hinder_Papa_Roach', 0.6507534980773926), ('KILL_HANNAH', 0.6474124193191528), ('Singer_Jacoby_Shaddix', 0.645458459854126)
Tiago [1 day ago] Thanks, Greg!
Tiago [1 day ago] But I think I'll do the useless coding now just for the sake of fun
Tiago [1 day ago] Aaaand the results are just in: 6685900 words
Tiago [1 day ago] so, almost 7 million words
Tiago [1 day ago] in total
Tiago [1 day ago] now, to count uniques
Tinsae [1 day ago] Nice!! waiting for the unique counts
Tiago [1 day ago] Aaand the count just used all the RAM!
Tinsae [1 day ago] You spent enough time in it. If there is a chance you could share your code. I would try it in my 56gb ram virtual machine. (edited)
Tiago [1 day ago]
import pandas as pd
import dask.dataframe as dd
!wget -c "https://storage.googleapis.com/kaggle-datasets/10100/277695/yelp-dataset.zip?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1551046405&Signature=p%2BBda9sOLceu8EceqHAhrOPKgraMDDfZ8%2FqiW1z8DygzgoJaUzBW5xL8MWtD5z55OVJWe61Zv9qpdVeHCXcZ0r4mpV7dezhwA2Mmus2OyJqAhhWoXxPSieNB36fHaFmrm6xu%2FOEp5R1TJ2s72vWKU7tnOJS9%2BFBYnvOuV6cPN4lrpVVTrixOjCLv8QQWndfGR7V1Wcz2MqLaqAVFWuN94WU6ycz%2BlcSvdgqPplNHmmzcT%2BBvepxcuTheWMPuHusGUPJ1FHHQ4BEv8RgrhtTvXtV6jxupyw69eMHWpdN0J4bWHACc2%2BGthxgK00Tplt%2FrINAFiPATMf7dpciyNKigAA%3D%3D" -O yelp.zip
!unzip yelp.zip
ddf = dd.read_json('yelp_academic_dataset_review.json', blocksize=2**28)
textao = ddf.text.to_bag()
size_all_itens = textao.count().compute()
distinct_items = textao.distinct().compute()
```
Tiago [1 day ago] soo.... apparently, only 4388 words
Tiago [1 day ago] that seems too little
Tiago [1 day ago] aaaand my code is broken
Tiago [1 day ago] anyway, going to close the ticket because the issue is solved (I guess)
Tinsae [1 day ago] Thank you very much. @Tiago. We had a very productive session. I learned a lot of new things (edited)
Tiago [1 day ago] thanks so much! Glad to hear it helped
David Reed (Agent) APP [1 day ago] I'm closing.
4
Crystal [2:01 PM] For the U4 capstone, I have data I performed tfidf on and then SVD to reduce the number of features. When I do a random forest classifier on it and look at the feature importances, is there a way to map those feature importances back to the original terms? So far I have: importances = rfc.featureimportances indices = np.argsort(importances)[::-1] (this gives the indices of important features in the SVD feature space)
I know there is a svd.inverse_transform() and vectorizer.inverse_transform(), I'm just having trouble figuring out what I would apply those too.
Thanks! (edited)
Tinsae [10 hours ago] Hi Crystal. SVD transformed features don't have name, right? We just call them feature0, feature1, ... Random forest may give feature importances like this: feature3 feature1 feature4
What if you do argsort on each one to see which original features dominated?
Tinsae [10 hours ago] I think, the above will not solve the issue. Doing argsort on each one will give which observations dominated.
Tinsae [10 hours ago] Assume feature3 is important. It is a projection from component3. What about doing argsort on svd.components[3]. It shows which original feature dominated component3
Crystal [10 hours ago] svdindices = np.argsort(svd2.components) words = list(df_tfidf2.columns)
features = [] for index in svd_indices: feature = [] This range is how many words I want for i in range(5): for counter, value in enumerate(index): if value == i: feature.append(words[counter]) features.append(feature)
Crystal [10 hours ago] not elegant, but I think this might be doing it
Crystal [10 hours ago] I think svd.components_ tells the relative importance of each component, so then if I argsort those, it gives the ranking of the words in terms of importance
Crystal [10 hours ago] so then when I use the indices from the rfc feature_importances, it seems to be working: Feature ranking:
Tinsae [10 hours ago] My previous mentor didn't want to comment on it. According to my references, it seems right. I assumed more variance is more importance.
I did similar thing using PCA
loading_scores = pd.Series(pca.components_[0], index=genes)
sorted_loading_scores = loading_scores.abs().sort_values(ascending=False)
sorted_loading_scores[0:10].index
Index(['gene49', 'gene20', 'gene42', 'gene70', 'gene67', 'gene63', 'gene100', 'gene98', 'gene44', 'gene73'], dtype='object')
Tinsae [10 hours ago] @mguner may comment on this (edited)
Crystal [10 hours ago] thanks
Mike Swirsky (Agent) APP [10 hours ago] Take a peek into Unit 6 Lesson 4 - advanced NLP. There are some convenience functions in section 5 for finding the top words associated with each topic AKA component of your topic model AKA latent semantic analysis.
Tinsae [10 hours ago] Never thought about this " is not the best technique because of its use of negative loadings."
Mike Swirsky (Agent) APP [9 hours ago] Negative loadings aren't always bad. It just means that if a word occurs in a document, it's unlikely that another topic applies to it. It can be meaningful, it's just not as obvious to interpret.
mguner [9 hours ago] Hi @Tinsae and @Crystal and everyone else. I don't really know an answer for your question but I thought you might find helpful the following link if you haven't see it : https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca Also I looked at a little bit the documentation in scikit-learn but didn't see just SVD. Did you used TruncatedSVD or something else ? Cross Validated Relationship between SVD and PCA. How to use SVD to perform PCA? Principal component analysis (PCA) is usually explained via an eigen-decomposition of the covariance matrix. However, it can also be performed via singular value decomposition (SVD) of the data mat...
Tinsae [9 hours ago] Great!!
Mike Swirsky (Agent) APP [9 hours ago] Yeah we are talking about using truncatedSVD as part of an LSA process.
mguner [9 hours ago] Hi @Mike Swirsky, thanks a lot for your response. And @Crystal, back to your original question, I feel like you cannot find a one to one relationship between the original features and the new features. But you can look at feature importance as I guess they look at the inner products of the original features column and the new features column. Especially in TruncatedSVD you reduce the dimensionality, so you lose information. Finally I think, inverse_transform just reverse the SVD but again it mashes the features during the procedure to return the original X. I hope, this helps. :